The current average size of documents in the index is 2kB, so we'll need ~3.5TB ...

mschoebel · on March 14, 2016

Thanks for the answer.

I suggest not to use AWS if you know that you'll need a server 24/7. Old-school hosters which offer dedicated servers are much cheaper for that use-case.

There are several offers here in Europe where you can get an i7-6700, 64gb RAM and 1tb SSD for less than €60/month. AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.

jasode · on March 14, 2016

>AWS would cost you at least 3-4x as much. You'll lose the flexibility of AWS, but save a ton of cash.

Isn't there more to the analysis than just comparing cpu before we can conclude it will save a lot of money?

It looks like their servers[1] use ~150TB source data that's already hosted on AWS disks. The source x.gz archives of the Common Crawl on AWS S3 are then imported to a Elasticsearch disks that are hosted on AWS.

To pull ~150TB of data using network speeds of 30 megabytes/sec[2] would take 60 days to transfer from AWS to another USA datacenter like Rackspace.

(Copying data from AWS to AWS isn't instantaneous either but it won't take ~60 days. At 60 days, the next crawl archive would have been released before you finished importing the previous one!)

Questions would be:

1) What are current 2016 network speeds between cloud providers?

2) What's the cost of ~150TB of network bandwidth?

3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?

[1]https://about.commonsearch.org/developer/operations

[2]http://www.networkworld.com/article/2187021/cloud-computing/...

fweespee_ch · on March 14, 2016

> 1) What are current 2016 network speeds between cloud providers?

I'm pretty sure if you need to ingest ~150 TB you can pull it from AWS/S3 much faster than you think. To absorb ~150TB you'd need ~75 nodes. Given you can download partials of Common Crawl, you can break it up to 75 nodes downloading in parallel with 1gbit/s ports you should be able to pull it down relatively quickly compared to your estimation.

I'd bet you could pull ~150 megabytes/s [16 mbit/s per node].

http://commoncrawl.org/the-data/get-started/

> The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3.

https://www.hetzner.de/en/hosting/produkte_rootserver/ex41s

> 2) What's the cost of ~150TB of network bandwidth?

Free.

> There are no charges for overage. We will permanently restrict the connection speed if more than 30 TB/month are used (the basis for calculation is for outgoing traffic only. Incoming and internal traffic is not calculated). Optionally, the limit can be permanently cancelled by committing to pay € 1.39 per additional TB used. Please see here for information on how to proceed.

> 3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?

I suspect you are greatly overestimating the difficulties since most DCs basically let you ingest/download for free because of the asymmetry on their networks.

sylvinus · on March 14, 2016

One thing to consider is that we can build the index on AWS and then only do replication with other datacenters at the Elasticsearch level, which is ~50x smaller than the raw data.