*Wear patterns and flash are an issue, although rotational drives fail too. Ther...

codemac · on Dec 10, 2012

> That said, if you're careful then that predictability should be a good thing.

Yes, it's a very good thing. In a high end SSD storage system, you predict early enough based on a calculation of how many drives there are, and what their current wear is, what type they are (SLC, eMLC, cMLC), etc. Then you phone home and have a drive delivered before the user even sees a disk failure.

With HDD's, the failure rate is so random that the disk completely failing is the signal that get's a replacement drive into the enclosure. S.M.A.R.T-type alert systems have been epic failures (too little info too late). The difference is because of the RAID rebuild having a lower probability of failure (let's set aside mulitple URE's for a second) that you can count on the mttf of the next drive failing being longer than the time it takes to get a drive out there.

However this is not much of a guarantee, so most people crazy over provision their storage.

SSDs let you predict this, thus provision correctly, and choose how to replace the drives to least impact the customer. It's win win to have predictable failure. I don't understand people who say otherwise.

bbulkow · on Dec 11, 2012

It's hard to predict without SMART features to measure the current wear state. Write amplification from the file system, and from the drive itself if you're not using large block writes, means you can't just calculate - you have to measure.

moe · on Dec 11, 2012

To extend on your comment: It's not possible to predict individual drive failure with reasonable accuracy. It's disconcerting to see the parent comment suggesting this still sits at the top of the thread.

You can roughly predict the longest possible lifespan for a SSD under a given workload. Regardless, a significant percentage of drives will still die earlier than that.

codemac · on Dec 11, 2012

I'm only saying this from experience developing storage systems that are yet unreleased. You can predict the lifespan of the SSD in the storage system if you give up many of the functions of the SSD controller and put them in software RAID.

If you're talking about most incredibly naive SSD storage systems available today (excluding violin memory and maybe xtreme/pure), then I agree with you.

peschkaj · on Dec 11, 2012

Depending on the SSD vendor, many drives expose performance counters to help you estimate wear level.

codemac · on Dec 11, 2012

I'm sorry, I didn't mean you'd calculate some value once for all the drives. It's definitely something you measure over the lifespan of the SSD itself with the rest of your QoS subsystem.

Reads, program/erases, controller ecc/ read disturb management, the g/p list mapping of the blocks... This all has to be taken into account in a dynamic way. And yes, some people are doing this at a higher level than the SSD controller.

kyrra · on Dec 10, 2012

I bet you could setup your drives to fail in a set pattern. Lets say you had 4 drives in a RAID-10. If they were all fresh, swap out the first 2 drives when they are at 50% wear... then on you could swap them out back and forth as they approach 100% wear.

Since they lifespan of some of the drives (Intel and Samsung I believe) is reported in the SMART data, you could easily do this.

SSD drives with SLC memory (enterprise SSD) have 100,000 P/E cycles, so they should last a while unless you are writing just a massive amount of data. Anandtech had a nice little writeup about SLC vs MLC vs TLC memory a little bit ago:

http://www.anandtech.com/show/6459/samsung-ssd-840-testing-t...

on Dec 10, 2012

[deleted]

eurleif · on Dec 10, 2012

That's talking about hard drives. The comment you quoted is talking about SSDs.

baruch · on Dec 10, 2012

Assuming this predictability is not a good idea in my experience. SSDs fail in various ways, some may be predictable and some are completely unpredictable. It is also not true that an ssd failure means it simply goes to readonly mode. I've seen plenty of SSDs failing unexpectedly and are no longer readable, returning sense key 0x4 (HARDWARE ERROR) and the only recourse is to ship them out.

The risk of correlated failures is indeed non-trivial in SSDs and plain RAID is riskier, be sure to keep a watchful eye on your arrays.

ChuckMcM · on Dec 10, 2012

We've currently got about 3500 SSDs in production across our clusters. I worry about them deciding to all fail at once, so far they have been sporadic failures (about 1/2 of which leave the drive unusable).

baruch · on Dec 11, 2012

Are they all the same model? How long have they been running? Is it a relatively similar load on all of them? Can you share smart attributes for them? (in private if needed)

On a very low fire I'm trying to create a disk survey project (http://disksurvey.org ) and such information is of great interest to me.

bunderbunder · on Dec 10, 2012

Good point. I was only addressing the predictability that comes from the flash memory simply wearing out. There are plenty of other ways that drives can fail. But I don't think they introduce any new worries for RAID users the way that flash memory wearing out after so many writes does.

baruch · on Dec 10, 2012

With that I agree.

moe · on Dec 11, 2012

So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan

Sorry, that is terrible advice. Do not do that.

In availability planning two is one and one is zero.

If you love your data you run your databases in pairs. If you really love your data you run them in triplets. This applies no matter what disk technology you're using.

Speculations about failure rate or -prediction don't belong here. Your server can go up in flames at any time for a dozen reasons, the disks being only one of them.

hvidgaard · on Dec 11, 2012

Not only that - you make sure that critical components, like hard drives (rotational or SSD, doesn't matter), are from different manufacturers or at least not the same production, or at the very least not put them to use at the same time. Basic design flaws (intentional or not), that result in non functional hardware, tend to hit at the same time - so you'd rather not want to have all 12 drives, and the hard drives of the 3 replicas, fail within the same week.

vidarh · on Dec 11, 2012

I used to run a free webmail service back in '99-2000. It was the first system of that kind of scale I'd worked on (about 1.5 million user accounts - large by the standards of the day - today I have more storage in my home fileserver; heck, I've got almost as much storage in my laptop), and though I wasn't in charge of ordering hardware I was equally oblivious to this problem as the guy who did.

I learned in a way that ensures this advice is burned into my memory forever:

It was when IBM had one of their worst ever manufacturing problems for one of their drive ranges.

While the IBM distributor we dealt with was very fast at turning around replacement drives, we had some nerve-wrecking weeks when the second drive in one of our arrays failed only something like 6-9 months after we went live, and we found out about the problem.

They all failed one after the other within a week or two of each other. Every drive in our main user mailbox storage array...

Thankfully for us, the gap was long enough between each failure that the array was rebuilt and then some in between each failure, but we spent a disproportionate amount of time babysitting backups and working on contingency plans because we'd made that stupid mistake.

(And I'll never ever build out a single large array, or any number of other thing - it made me spend a lot of time thinking about and reading up on redundancy and disaster recovery strategies, as it scared the hell out of me; it was mostly luck that prevented us from losing a substantial amount of data)

bunderbunder · on Dec 11, 2012

You're responding as if I had suggested that this is a replacement for all the other practices one should already be doing.

If I had, yes I would agree with you 100%. However, far from suggesting anything remotely like that, I made sure to work in the phrase "add that extra step." It's not a panacea, it's an additional thing that needs to be done to account for one new quirk that a particular technology throws into the mix.

scoot · on Dec 10, 2012

With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.

Not disputed that there's a slightly increased chance of concurrent disk failures with SSD, but on what basis is a second failure before rebuild any better than aduring it?

Also, I'm guessing you're referring to RAID5, as RAID6 / RAID DP is immune to double-disk failure, and RAID 10 and 0+1 are more tolerant of it.

jasonkolb · on Dec 10, 2012

I wasn't aware of that... anyone know how that's handled when you're buying an SSD instance from say, Amazon? Do they predict failure and replace the drives before they go bad, or do you have to bake this into your deployment logic somehow?

cmer · on Dec 11, 2012

Just curious, which software do you recommend to monitor SSDs life span/upcoming death in production?

Can something like Munin do this out of the box?-

incredimike · on Dec 11, 2012

Munin cannot predict a SSD's lifespan.

specto · on Dec 10, 2012

Most sysadmins watch counters and set their calendars to replace the drives before they fail since it is so predictable.

venomsnake · on Dec 10, 2012

But with current SSD speed/size ratio this vulnerability window can be only a few minutes, also this can be minimized with mixing batches and vendors of drives.

bunderbunder · on Dec 10, 2012

Hopefully. . . though speaking of speed, there's another pitfall to be aware of: At present, very few RAID controllers support the TRIM command. On one that doesn't any SSDs plugged into it will slow down over time, perhaps becoming slower than magnetic disks.

jQueryIsAwesome · on Dec 11, 2012

Or just mixing intervals of memory installation; so you install the next SSD at half the life-span of the previous one.