Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

There are plenty of architectures that do exactly this. EMR-on-S3, Google Dataproc on GCS, Snowflake-on-S3, BigQuery-on-GCS, etc etc.

The bigger point in the article is that these exact "take processing to the data" architectures operate exceedingly well on S3, GCS, Azure.

And, as a biased observer, these architectures operate on GCS the best due to great performance measured in the article, quick VM standup times, low VM prices, and per-minute billing.



I'm still trying to parse the docs and Manta source code to see what it actually does, but it seems unique if the data storage nodes are also the data processing nodes and no data transfer happens from some storage service before the job begins. The other key factor is having neither startup time nor the cost of a perpetually running cluster. Per my comment below [1], we have used Lambda with S3 to get something like this, as well as our own architecture built on plain EC2/GCE nodes.

[1] https://news.ycombinator.com/item?id=10846514


Not only that but the thing is built by guys who really know what they are doing like Bryan Cantrill and other former SUN top people.


got it. thanks!


As you sure you understand what "take the processing to the data" means?

EMR-on-S3 is the "copy the data to the processing nodes" variety.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: