We self host splunk and it can plow through petabytes of high cardinality data pretty dang fast. If the fields are not indexed and the search is complex, it can take minutes or hours. But usually, I can get live and historic data in a few seconds.
As an example, we have a pipeline of services. I can compute the time spent in each service with multiple levels of percentiles and group that data by high cardinality fields (as in, hundreds of thousands or more values). I just did a search for 4 hours of data across thousands of nodes for half a dozen or so services with multiple eval statements all piped to a timechart doing over a dozen stats operations. Half a billion events. It got done in under a minute.
Splunk charges so much because they are just so dang powerful.
This has not been true in my experience. I run a Splunk server in production and at my data volumes it has been very performant. It's also much easier to setup and maintain than ELK clusters.
In the early days Splunk pricing was exorbitant (we evaluated Splunk 7 years ago and dismissed it), but licensing has changed in recent years and it is now priced by volume ingested (the pricing is transparent and listed on their website now). At low volumes, the pricing is similar to Sumologic, and is pretty accessible now to smaller dev shops. Open-source collectors like fluentd also help to intelligently reduce the ingest volume.
Does the speed matter? How exactly?
I am genuinely curious: at Scalyr we _can_ be very fast but it is a balance with the cost that we want to pass on as price savings. Same with self-hosted Elastic: one can fine-tune it to be fast but minding the cost constraints gets it slower. WDYT?
Yes same here. I actually monitor the throughput of the network interfaces on our forwarder with prometheus/statsd_exporter and if outbound is smaller than inbound it sets off alerts!