Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

I used to run a free webmail service back in '99-2000. It was the first system of that kind of scale I'd worked on (about 1.5 million user accounts - large by the standards of the day - today I have more storage in my home fileserver; heck, I've got almost as much storage in my laptop), and though I wasn't in charge of ordering hardware I was equally oblivious to this problem as the guy who did.

I learned in a way that ensures this advice is burned into my memory forever:

It was when IBM had one of their worst ever manufacturing problems for one of their drive ranges.

While the IBM distributor we dealt with was very fast at turning around replacement drives, we had some nerve-wrecking weeks when the second drive in one of our arrays failed only something like 6-9 months after we went live, and we found out about the problem.

They all failed one after the other within a week or two of each other. Every drive in our main user mailbox storage array...

Thankfully for us, the gap was long enough between each failure that the array was rebuilt and then some in between each failure, but we spent a disproportionate amount of time babysitting backups and working on contingency plans because we'd made that stupid mistake.

(And I'll never ever build out a single large array, or any number of other thing - it made me spend a lot of time thinking about and reading up on redundancy and disaster recovery strategies, as it scared the hell out of me; it was mostly luck that prevented us from losing a substantial amount of data)



Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: