> 1) What are current 2016 network speeds between cloud providers? I'm pretty su...

> 1) What are current 2016 network speeds between cloud providers?

I'm pretty sure if you need to ingest ~150 TB you can pull it from AWS/S3 much faster than you think. To absorb ~150TB you'd need ~75 nodes. Given you can download partials of Common Crawl, you can break it up to 75 nodes downloading in parallel with 1gbit/s ports you should be able to pull it down relatively quickly compared to your estimation.

I'd bet you could pull ~150 megabytes/s [16 mbit/s per node].

http://commoncrawl.org/the-data/get-started/

> The Common Crawl dataset lives on Amazon S3 as part of the Amazon Public Datasets program. From Public Data Sets, you can download the files entirely free using HTTP or S3.

https://www.hetzner.de/en/hosting/produkte_rootserver/ex41s

> 2) What's the cost of ~150TB of network bandwidth?

Free.

> There are no charges for overage. We will permanently restrict the connection speed if more than 30 TB/month are used (the basis for calculation is for outgoing traffic only. Incoming and internal traffic is not calculated). Optionally, the limit can be permanently cancelled by committing to pay € 1.39 per additional TB used. Please see here for information on how to proceed.

> 3) From those datapoints, can we derive a rough rule-of-thumb where a certain amount of data exceeds the current capabilities (speed or economics) of the internet backbone available to projects like Common Search?

I suspect you are greatly overestimating the difficulties since most DCs basically let you ingest/download for free because of the asymmetry on their networks.