Migrating 23TB from S3 to B2 in 7 hours

andrewstuart · on April 12, 2019

@ USD$0.05 cents per gigabyte out from S3, that's about USD $1150 to transfer 23TB is that right?

https://aws.amazon.com/s3/pricing/

I'd be pretty nervous about hitting the delete button after the data transfer.

CherryJimbo · on April 12, 2019

Almost double that, as we didn't hit the tier for the lower $0.05 pricing. We also used some of their more expensive locations for some storage too, like Singapore - the 23TB was spread between different regions.

qaq · on April 13, 2019

Ain't aws great would've being like what $250 on DO

Operyl · on April 13, 2019

And then you also get to deal with DO losing entire clusters of Spaces, for a week+ at a time with no timely status updates. That was great ..

scrollaway · on April 13, 2019

You're not buying the same thing. Comparing AWS to DO is like comparing your home printer-scanner with the Hubble telescope.

Yes, one of them is more expensive.

lultimouomo · on April 13, 2019

Really?

I'm not saying AWS doesn't offer anything over bare servers, but do people think it's a Hubble to their home printers?

yani · on April 13, 2019

Yes. AWS is that awesome with the number of different and sophisticated services that they offer. Azure is probably the only other cloud that comes close. DigitalOcean is cheap, unreliable, and the amount of services being offered can be counted on the fingers of my one hand. I can talk about the issues there are with DO on a daily basis. Once you are past MVP, your business needs a better home.

stanmancan · on April 13, 2019

I would love to hear more about the issues you’ve experienced with DO. We have been using them more and more lately and have had a much better experience there than on AWS, GCP, or Linode.

manigandham · on April 13, 2019

That's a bad analogy but yes, AWS is a vast portfolio of services to solve your problems. If all you need are some bare-metal servers and nothing else then you really don't need AWS in the first place and are paying a premium for stuff you're not using.

qaq · on April 13, 2019

They have a vast portfolio of services to solve resume driven architectures problem. The scale at which you outgrow 2 modern high end load balanced boxes is something 99% of the projects will never achieve. There are legitimate use cases for cloud services but what majority of projects are using them for is not it. Reminds me of service at FB that issues and processes data from roughly several hundred million requests per day runs on a single box and is a regular python app. 99% of startups would implement same thing as some data processing pipeline with kafka, spark , hadoop or whatever else is "absolutely critical" to have to process 1/1000 the size of data.

qes · on April 16, 2019

It's a bit of a stretch, but yeah there's a pretty vast difference between AWS or Azure and DigitalOcean.

deforciant · on April 13, 2019

And comparing AWS to GCP is like comparing Hubble telescope to James Webb telescope :)

Recently I have moved my infra to a more hybrid architecture where all complex services run on major cloud providers while stateless compute services run on a cheap vultr vms. Result is quite nice!

quickthrower2 · on April 13, 2019

You could buy the storage disks for $2000 from Amazon.com.

patrickg_zill · on April 13, 2019

AWS is overpriced in a way, but people justify it. IMHO they justify it more than they should, but...

desdiv · on April 13, 2019

Would this crazy-ass strategy work?

1. Spin up 50 AWS Lightsail instances[0] for parallelism

2. In each instance download from S3 and upload to B2.

S3 to any AWS service in the same region is free[1]. $5 Lightsail instances come with 2TB of data transfers each, so 50 of them can easily handle 23TB. The whole transfer can be done within a few hours so the total computing cost is less than $10 ($5 / 30 * 50 = $8.3). Total data retrieval cost for S3 is ($0.0007 per GB) * 23,000GB = $16.1.

[0] https://aws.amazon.com/lightsail

[1] "Transfers between S3 buckets or from Amazon S3 to any service(s) within the same AWS Region are free." according to https://aws.amazon.com/s3/pricing/

coolgeek · on April 13, 2019

LightSail has (or at least had) a hard limit of 20 instances. They also have a soft limit of 2 instances, after which you must request an upgrade to a higher limit. I had to submit a support request explaining my intended usage. It took a week to get approved.

The stated reason for these limits is to avoid unexpectedly large bills. But I suspect that it's also to prevent crazy-ass strategies for getting around bandwidth costs.

manigandham · on April 26, 2019

Lightsail instances have terrible bandwidth throughput. It has high limits because it's all overbooked low-priority traffic shunted off their network as soon as possible.

tgtweak · on April 13, 2019

This is absolutely sneaky and would work. I believe S3 also has file access cost but it's minimal.

If you had a partitioned list of what you needed to move per node, so that coordination was minimal between them, this would be pretty viable.

rob-olmos · on April 12, 2019

It would be about $2053.03 if it's only the 23TB for that month from N.Virginia region.

IloveHN84 · on April 13, 2019

It was cheaper to buy some NAS

bluedino · on April 13, 2019

12 bay Synology is $1,600, load it with 24 4TB drives for another $2,000

sanxchit · on April 12, 2019

Great writeup on data migrations. I was wondering whether you did a comparison for this method vs using AWS snowball[1] to export S3 Data and B2 Fireball[2] to ingest it.

[1] - https://docs.aws.amazon.com/snowball/latest/ug/create-export...

[2] - https://help.backblaze.com/hc/en-us/articles/360001918654

CherryJimbo · on April 12, 2019

We looked briefly at the snowball and fireball, but wanted to do this as quickly as possible, whilst keeping the process entirely transparent to our users. It was also an excuse for our team to get intimately familiar with the B2 API, since it's not compatible with S3.

If we were to consider another large migration like this, physical media would probably be the way to go.

late2part · on April 13, 2019

I evaluated moving 2PB with snowball v. putting in 10g/100g links. The issue with snowball ( i started a company that did what snowball did and shut it down (failed )) and other fedex/RAID solutions is that you have 3 transfers. You think the LAN transfer will be quick - but you're generally rate-limited by the systems more than you are by the bandwidth-delay product. If you're in a high traffic DC area, it's pretty easy to get temp bw or install circuits to carry that. 10g for 2pB is 18 days of transfer - which sounds like a lot - but that's 5 days of transfer on each site, 1 day of setup, and 1 week of shipping. Those numbers aren't real but they're close.

So, snowball works in a lot of area, but like so many AWS products, it works if you adapt to it.

pigz/scp/zstd works extremely fast in line.

In your case you're pulling from S3 to another object store.

I moved ~1PB from one S3 region to another. "Why not use replication," they asked. That only works if it's turned on when you upload the object - another fine-print 'gotcha' in the easy AWS service. Then you get into rate-limits. In 2010 I asked AWS if I could spin up 1000 servers to test something - nope - elastiticy at that level is for the big boys.

Now I work for a large cloud company and we still run into elasticity.

To move the 1PB from one S3 region to another we spun up hundreds of spot instances (oh, we were compressing and glacierizing it too) and built a perl/mysql batch job "s3 get | zstd | s3 put" process and parallelized it. One thing nice about S3 is it pulls the md5 hash - unless multipart, in which case it's the hash of the hash, oh yeah.... So you should split it in advance if you want to verify the hash (more fine print).

Worked great. Good for you for sharing this project, very cool.

rsync · on April 13, 2019

"10g for 2pB is 18 days of transfer - which sounds like a lot - but that's 5 days of transfer on each site, 1 day of setup, and 1 week of shipping."

I can confirm this, to some degree.

We have larger customers with 20 or 40 or 80 TB of data to bring into rsync.net and everyone is always very interested in physical delivery, which we offer, but it's always easier to nurse along a 20-30 day transfer than ship JBODs around.

As long as you have a transfer mechanism that can be resumed efficiently (such as zfs send) and you don't have terribly bad bandwidth, we always counsel to just run the very long transfer. It does help that we are inside he.net in one location and two hops away from their core in two other locations and we can just order 10gb circuits on a days notice ... because he.net rocks.

toomuchtodo · on April 13, 2019

I’ve never heard a bad thing about Hurricane Electric as a bandwidth/colo provider. Happy to hear their reputation persists.

semi-extrinsic · on April 13, 2019

As a physicist I've always found the name "elastic scaling" funny. If it's elastic in the physical sense, it means that the energy required to grow to some size is quadratic (or higher) in the size. The marketing meaning is "easy scaling", but the physical meaning is "really hard scaling".

E.g. compare a soap bubble versus a bubble gum bubble. It's a lot easier to scale up the soap bubble, which is not elastic.

javajosh · on April 13, 2019

It's a very good observation, and I think it's more than just a funny aside. The word 'elastic' connotes increasing resistance as the cluster grows, but this is a false intuition. From AWS's POV 'resistance' to adding a node is generally small, fixed, and, in general, independent of cluster size. I suspect this is what makes cloud computers in general, and EC2 in particular, such a cash-cow.

Moreover it turns out that elasticity is a very valuable quality of a cluster for most workloads; we want this intuition to be true, that our cluster meets resistance as it grows, in the sense that it will shrink when the workload decreases. This matches our economic intuition, too. We want this so much we have to build another software layer to make this happen - e.g., k8s.

cheerlessbog · on April 13, 2019

"Quantum leap" is similarly misused to mean"big change" when it's physical meaning is smallest physically possible change"

tedk-42 · on April 13, 2019

We did a Snowball transfer of 150TB (mostly media files) from our on prem DC. Cost is one thing we really failed to plan on. You're charged per day you have the snowball (in our case 3 for 2 separate DCs).

During the transfer, the AWS sync constantly failed due to random issues which drove up the total time to transfer the files. Something like having a tilde (~) in the filename will totally break the sync. You really need to keep track of where it failed. We were constantly trying to craft additional rules into our sync logic to catch the 'gotchas'.

Another point you alluded to was the Etag/MD5sum that's stored in AWS. Pretty useful if you know how to use it...

_ytji · on April 13, 2019

B2 should allow you to ship your full snowball directly to them to offload

Can_Not · on April 14, 2019

I assume I'm not the only one who has never heard of B2 before now, so I'll summarize everything I've found out:

https://www.backblaze.com/b2/cloud-storage-pricing.html

It appears to be an S3 clone. Competing via lower price and less micro charges. Highlights compared to Digital Ocean and Wasabi:

DO/Wasabi: minimum $5/month, but great deals compared to AWS/GCE otherwise.

B2: First 10GB storage free (probably not including bandwidth).

For a side project or startup looking for its first storage option, B2 seems compelling. But something important: is it a drop in replacement? Is the api accessible on your platform?

https://help.backblaze.com/hc/en-us/articles/218513487-Is-th...

I don't have a clear answer right now, but when I get closer to deploying my current side project, it's something I'll take a deeper look into.

iansltx · on April 15, 2019

Their API is not S3 compatible. There are a bunch of client libraries on various platforms for it though.

They have 1GB free of outgoing bandwidth per day. If you put them behind Cloudflare then you effectively get your bandwidth for free. So half a cent per GB per month is all you way. Plus your time building out the integration instead of S3.

cavisne · on April 13, 2019

I'm guessing the payback time on this is fairly long by the time you factor in the cost of moving all the backups. S3 One Zone is pretty competitive against Backblaze.

Cloudflare is betting on most of their customers serving HTML not large uncachable blobs, that "Bandwidth Alliance" will disappear pretty quickly at any sort of scale.

CherryJimbo · on April 13, 2019

I can't speak for Cloudflare, but we've talked to them pretty extensively about the whole project, and even did a case study about our use-case with the bandwidth alliance, and as we switched cloud providers. Things may change in the future of course, but they very much encouraged what we were doing. https://www.cloudflare.com/case-studies/nodecraft-bandwidth-...

toomuchtodo · on April 13, 2019

Would you consider open sourcing the micro service you wrote to perform the migration? I could see it being helpful to others interested in migrating from S3 to B2.

toomuchtodo · on April 17, 2019

This request can be disregarded. I’m going to explore extending s3proxy for the same purpose (migration and backfill of disparate object storage systems through an abstraction layer).

kbowman · on April 12, 2019

Did you compare it against Wasabi?

CherryJimbo · on April 12, 2019

We did not. Wasabi looks interesting - thanks!

chime · on April 12, 2019

We have about 125TB on Wasabi. Costs about 650/mo. Their performance for large files is great.

metildaa · on April 12, 2019

Haven't they had stability issues? I know it was troublesome for some Mastodon instances.

chime · on April 13, 2019

Once so far in last 8mo. It wasn’t a big issue in our use case, especially since there was no data loss.

manigandham · on April 13, 2019

I also recommend Wasabi. Even cheaper than B2 and they follow the S3 API so all existing tools still work.

They also have different billing models with a free egress plan if you aren't downloading data that often.

dgemm · on April 13, 2019

If this is purely backup data, wouldn't glacier be a better fit than either S3 or B2?

Glacier is already cheaper than B2 and has the advantage of storing data redundantly across datacenters. And glacier deep archive is 4 times cheaper than that.

Full disclosure: AWS employee

CherryJimbo · on April 13, 2019

They're "backups" in the sense of storing customer data, but realistically they're more like instance snapshots for various customer game servers. They're created and restored multiple times a day for every customer as they hop between games (accessed very frequently), so Glacier wouldn't be a good fit.

hrez · on April 14, 2019

If it's that much churn, wouldn't it make sense to point new writes/updates to B2 and let s3 naturally drain/lifecycle out its data?

dgemm · on April 14, 2019

Probably a technically better solution but it would not make as good of a PR piece as an epic migration.

amenod · on April 14, 2019

Actually I think it would sound better - "How to move 23TB without moving (almost) anything". Especially to technical people.

Twirrim · on April 13, 2019

B2 didn't used to be multiple ADs, or at least not multiple ADs over a wider geographic area, like AWS's regions. I'm not sure if that's changed at all. It used to be why B2 was able to be cheaper. You were paying for less durability and availability.

S3 storage is redundant over an entire region. Entire availability zones can go down (extremely improbable) without your data being put at risk.

It, therefore, struck me as odd to see:

> Due to S3 and B2 being at least nearly equally* accessible, reliable, available, as well as many other providers, our primary reason for moving our backups now became pricing.

but then ...

> * Science is hard, blue keys on calculators are tricky, and we don’t have years to study things before doing them

WTF? They're trying to be joking here I hope, but that comes across somewhat "We don't give a crap about our customer's data".

gregorysudderth · on April 13, 2019

Sorry for the bad vibe.

I was trying to interject some humor into a dry process. When I asked a couple peers from my storage background, "how would you do this project" they both said the same thing: 1) form a working group 2) do a study 3) test 4) 90 days later, do an analysis. Epic process.

We care a lot, but that's beyond our scope. This was an all-hands reviewed process, and like machines that are built to be ultimately reliable including: hospital generators, human-rated spacecraft equipment, airplane engines for single-engine remote location flying (PT6), our process was detuned speed-wise specifically, for overall service reliability and minimal impact.

We tested, we analyzed, we saw weird failures (python core dump?) in simple code, and followed up on all of those before proceeding. We agonized.

In the end we were satisfied, because our customers couldn't tell anything changed, and our CSR's got zero new tickets. Not bragging at all, relieved of course, but, we care about the CSR's load and work experience too.

Thanks for checking the article out, and let me know if we can show you more about our service.

snissn · on April 13, 2019

No - the pricing on glacier makes it a steaming pile of shit

camtarn · on April 13, 2019

I was pleasantly surprised to see that the previous rather crazy retrieval pricing model (price based on the peak retrieval bandwidth you used at any time in the current month) was replaced by a straight price-per-GB model a couple years back. That made my life a lot easier when pricing up Glacier vs B2 for a client.

dgemm · on April 13, 2019

It's built for off-site backup type workloads, where you write once and read hopefully never.

You are probably using it wrong.

vviktor · on April 14, 2019

Wouldn't it be simpler if you gradually moved? https://martinfowler.com/bliki/ParallelChange.html

I understand your main concern for moving was pricing. Developer hours also cost money. It seems like you had to invest much more developer hours than you would have if you gradually moved over the course of one month or so. (probably a week?)

Bombthecat · on April 13, 2019

Are they still one data center only?

BartBoch · on April 13, 2019

It seems that they have two DCs now, not sure if the second is used for redundancy though.

stevefan1999 · on April 13, 2019

I mean, the best way to migrate this huge amount of data was still has had to be using a physical migration service like AWS Snowmobile[0], right?

[0]: https://aws.amazon.com/snowmobile/

paddor · on April 17, 2019

Can I ask why ZIP? Isn’t it quite heavy on CPU at a not-that-good compression rate? I’m thinking of LZ4 or Zstandard instead.

CherryJimbo · on April 25, 2019

We had to pick something that was "good enough" for compression time/size, as well as easy for our customers to download/view if they wish, on any OS. Zip being supported in every popular operating system, and the average user using Windows being able to right click -> unzip, was the primary reason for choice.

There are of course significantly faster and more efficient compression formats like LZ4 which would be ideal if we were solely using the data internally in managed environments, but we offer these backups as downloads to our users, some of which aren't very technically inclined and still need to be able to access the files easily.

ryanmarsh · on April 13, 2019

The stated reason for moving to B2 was pricing. No breakdown of cost was given however. I’m not sure how they arrived at that conclusion.

late2part · on April 13, 2019

What's the monthly storage cost on that?

stavros · on April 13, 2019

$115.

howiroll · on April 12, 2019

I’ve heard Backblaze has only one region. Is that true?

CherryJimbo · on April 12, 2019

Unfortunately... yes. They have Europe coming "really soon" when we were talking to them which we're excited about, but it's one location in the west coast of the US currently.

_ugfj · on April 12, 2019

They said they use Cloudflare as well...

CherryJimbo · on April 12, 2019

We do use Cloudflare, but a lot of the instance backups we store are multiple GBs - Cloudflare doesn't cache those. Not to mention uploads from regions like Singapore can be very slow all the way to the US.

It's not a deal-breaker for us, but we're very much looking forward to when they can support more regions.