Wear patterns and flash are an issue, although rotational drives fail too. There are several answers. When a flash drive fails, you can still read the data. A clustered database and multiple copies of the data, you gain reliability – a server level of RAID. As drives fail, you replace them.
Unlike magnetic disks, SSDs have a tendency to fail at a really predictable rate. So predictably that if you've got two drives of the same model, put them into commission at the same time, and subject them to the same usage patterns, they will probably fail at about the same time. That's a real problem if you're using SSDs in a RAID array, since RAID's increased reliability relies on the assumption that it's very unlikely for two drives to fail at about the same time.
With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.
That said, if you're careful then that predictability should be a good thing. A good SSD will keep track of wear for you. So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan. If you add that extra step you're probably actually improving your RAID's reliability. But if you treat your RAID as if SSDs are just fast HDDs, you're asking for trouble.
> That said, if you're careful then that predictability should be a good thing.
Yes, it's a very good thing. In a high end SSD storage system, you predict early enough based on a calculation of how many drives there are, and what their current wear is, what type they are (SLC, eMLC, cMLC), etc. Then you phone home and have a drive delivered before the user even sees a disk failure.
With HDD's, the failure rate is so random that the disk completely failing is the signal that get's a replacement drive into the enclosure. S.M.A.R.T-type alert systems have been epic failures (too little info too late). The difference is because of the RAID rebuild having a lower probability of failure (let's set aside mulitple URE's for a second) that you can count on the mttf of the next drive failing being longer than the time it takes to get a drive out there.
However this is not much of a guarantee, so most people crazy over provision their storage.
SSDs let you predict this, thus provision correctly, and choose how to replace the drives to least impact the customer. It's win win to have predictable failure. I don't understand people who say otherwise.
It's hard to predict without SMART features to measure the current wear state. Write amplification from the file system, and from the drive itself if you're not using large block writes, means you can't just calculate - you have to measure.
To extend on your comment: It's not possible to predict individual drive failure with reasonable accuracy. It's disconcerting to see the parent comment suggesting this still sits at the top of the thread.
You can roughly predict the longest possible lifespan for a SSD under a given workload. Regardless, a significant percentage of drives will still die earlier than that.
I'm only saying this from experience developing storage systems that are yet unreleased. You can predict the lifespan of the SSD in the storage system if you give up many of the functions of the SSD controller and put them in software RAID.
If you're talking about most incredibly naive SSD storage systems available today (excluding violin memory and maybe xtreme/pure), then I agree with you.
I'm sorry, I didn't mean you'd calculate some value once for all the drives. It's definitely something you measure over the lifespan of the SSD itself with the rest of your QoS subsystem.
Reads, program/erases, controller ecc/ read disturb management, the g/p list mapping of the blocks... This all has to be taken into account in a dynamic way. And yes, some people are doing this at a higher level than the SSD controller.
I bet you could setup your drives to fail in a set pattern. Lets say you had 4 drives in a RAID-10. If they were all fresh, swap out the first 2 drives when they are at 50% wear... then on you could swap them out back and forth as they approach 100% wear.
Since they lifespan of some of the drives (Intel and Samsung I believe) is reported in the SMART data, you could easily do this.
SSD drives with SLC memory (enterprise SSD) have 100,000 P/E cycles, so they should last a while unless you are writing just a massive amount of data. Anandtech had a nice little writeup about SLC vs MLC vs TLC memory a little bit ago:
Assuming this predictability is not a good idea in my experience. SSDs fail in various ways, some may be predictable and some are completely unpredictable. It is also not true that an ssd failure means it simply goes to readonly mode. I've seen plenty of SSDs failing unexpectedly and are no longer readable, returning sense key 0x4 (HARDWARE ERROR) and the only recourse is to ship them out.
The risk of correlated failures is indeed non-trivial in SSDs and plain RAID is riskier, be sure to keep a watchful eye on your arrays.
We've currently got about 3500 SSDs in production across our clusters. I worry about them deciding to all fail at once, so far they have been sporadic failures (about 1/2 of which leave the drive unusable).
Are they all the same model?
How long have they been running?
Is it a relatively similar load on all of them?
Can you share smart attributes for them? (in private if needed)
On a very low fire I'm trying to create a disk survey project (http://disksurvey.org ) and such information is of great interest to me.
Good point. I was only addressing the predictability that comes from the flash memory simply wearing out. There are plenty of other ways that drives can fail. But I don't think they introduce any new worries for RAID users the way that flash memory wearing out after so many writes does.
So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan
Sorry, that is terrible advice. Do not do that.
In availability planning two is one and one is zero.
If you love your data you run your databases in pairs. If you really love your data you run them in triplets. This applies no matter what disk technology you're using.
Speculations about failure rate or -prediction don't belong here. Your server can go up in flames at any time for a dozen reasons, the disks being only one of them.
Not only that - you make sure that critical components, like hard drives (rotational or SSD, doesn't matter), are from different manufacturers or at least not the same production, or at the very least not put them to use at the same time. Basic design flaws (intentional or not), that result in non functional hardware, tend to hit at the same time - so you'd rather not want to have all 12 drives, and the hard drives of the 3 replicas, fail within the same week.
I used to run a free webmail service back in '99-2000. It was the first system of that kind of scale I'd worked on (about 1.5 million user accounts - large by the standards of the day - today I have more storage in my home fileserver; heck, I've got almost as much storage in my laptop), and though I wasn't in charge of ordering hardware I was equally oblivious to this problem as the guy who did.
I learned in a way that ensures this advice is burned into my memory forever:
It was when IBM had one of their worst ever manufacturing problems for one of their drive ranges.
While the IBM distributor we dealt with was very fast at turning around replacement drives, we had some nerve-wrecking weeks when the second drive in one of our arrays failed only something like 6-9 months after we went live, and we found out about the problem.
They all failed one after the other within a week or two of each other. Every drive in our main user mailbox storage array...
Thankfully for us, the gap was long enough between each failure that the array was rebuilt and then some in between each failure, but we spent a disproportionate amount of time babysitting backups and working on contingency plans because we'd made that stupid mistake.
(And I'll never ever build out a single large array, or any number of other thing - it made me spend a lot of time thinking about and reading up on redundancy and disaster recovery strategies, as it scared the hell out of me; it was mostly luck that prevented us from losing a substantial amount of data)
You're responding as if I had suggested that this is a replacement for all the other practices one should already be doing.
If I had, yes I would agree with you 100%. However, far from suggesting anything remotely like that, I made sure to work in the phrase "add that extra step." It's not a panacea, it's an additional thing that needs to be done to account for one new quirk that a particular technology throws into the mix.
With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.
Not disputed that there's a slightly increased chance of concurrent disk failures with SSD, but on what basis is a second failure before rebuild any better than aduring it?
Also, I'm guessing you're referring to RAID5, as RAID6 / RAID DP is immune to double-disk failure, and RAID 10 and 0+1 are more tolerant of it.
I wasn't aware of that... anyone know how that's handled when you're buying an SSD instance from say, Amazon? Do they predict failure and replace the drives before they go bad, or do you have to bake this into your deployment logic somehow?
But with current SSD speed/size ratio this vulnerability window can be only a few minutes, also this can be minimized with mixing batches and vendors of drives.
Hopefully. . . though speaking of speed, there's another pitfall to be aware of: At present, very few RAID controllers support the TRIM command. On one that doesn't any SSDs plugged into it will slow down over time, perhaps becoming slower than magnetic disks.
Unlike magnetic disks, SSDs have a tendency to fail at a really predictable rate. So predictably that if you've got two drives of the same model, put them into commission at the same time, and subject them to the same usage patterns, they will probably fail at about the same time. That's a real problem if you're using SSDs in a RAID array, since RAID's increased reliability relies on the assumption that it's very unlikely for two drives to fail at about the same time.
With an SSD, though, once one drive goes there's a decent (perhaps small, but far from negligible) chance that a second drive will go out before you've had a chance to replace the first one. Which makes things complicated, but is much better than the similarly likely scenario that a second SSD fails shortly after you replace the first one. Because then it's possibly happening during the rebuild, and if that happens then it really will bring down the whole RAID array.
That said, if you're careful then that predictability should be a good thing. A good SSD will keep track of wear for you. So all you've got to do is monitor the status of the drives, and replace them before they get too close to their rated lifespan. If you add that extra step you're probably actually improving your RAID's reliability. But if you treat your RAID as if SSDs are just fast HDDs, you're asking for trouble.