Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
NAND Flash: Dealing with a Flawed Medium (cushychicken.github.io)
50 points by cushychicken on Aug 23, 2015 | hide | past | favorite | 15 comments


I personally hate nand. I do embedded stuff all day long, and you spend more time dealing with NAND issues than anything else in the dev of a product.

It not only ship to you with bad blocks, but will fail if you write to it, it will also fail /by itself/ if left alone, so even a read-only filesystem is not safe, until you implement really paranoid duplication of pretty much everything. You can't JUST rely on ECC, you need to have duplication to allow the system to continue working; ECC will just tell you 'He's dead, Jim' and that doesn't help if you're a production device.

Also, from my experience, newer NAND seems to nuke erase blocks by /bunches/ not just by units, so if you lose one, more often than not you love 8+ in a row. Also, the size of the erase blocks are getting bigger, so you lose a hell of a lot more every time you lose a block.

Add the bloated file systems in linux (JFFS2 scan at mount time can be 30s easily!), and you end up wondering if it's really worth the trouble.

These days I always propose using a micro-sd card (of quality) for any large storage need as it's a lot easier to replace, and a SPI NOR flash for the system if at all possible (it's horrendously slow to erase/write, but at least it's stable)


Having seen a lot of similar issues, I can say I feel your pain. It can be really frustrating to get to the root of these issues, and when you do, management is rarely willing to accept the answer of "We've done all we can, the root of the problem is a device flaw we can't change."

How recent of a jffs2 image are you using? They've implemented some block tables in newer versions that speed up mount time a lot if my understanding is correct. Also, I would highly recommend trying out UBIFS if you get the chance - it's jffs2's successor, and well implemented.


Until very recently I wrote firmware for SSD controllers. In addition to the error correction mentioned in the article, we also used RAIN (Redundant Array of Independant NAND) as yet another data protection measure. You can read more about it here: https://www.micron.com/~/media/documents/products/technical-...


I'd love to learn more about SSD firmware. Do any companies publish their SSD NAND management algorithms?

Unrelated: did you work at Micron? I was a Micron DRAM PE for a year.


Companies generally do not publish their algorithms, unless they are filing for a patent. I came across this great blog post a while back that will give you a pretty solid foundation for NAND management: http://codecapsule.com/2014/02/12/coding-for-ssds-part-1-int.... You can google more if you find any of the sections interesting.

Yes, I did work for Micron! I'm doing data science now, though, so I'm not currently in the storage industry.


Cool, I'll definitely read this when I get home! Would love to compare notes against what I know regarding embedded NAND techniques.

The semiconductor industry really isn't a bad jumping off point for data scientists. Most of the work I did in PE was statistical analysis of DRAM module tests and trying to hunt down trends in test data to weed out failures earlier. Having an environment where you can set up experiments with a million data points is pretty sweet in its own way.


My uncle started and sold a few companies in the NAND Flash testing market for manufacturers in the mid-90s and early 2000s. I'm having dinner with him this evening - any questions I can query him with about the state of the art?

The statement: "Since most NAND manufacturers allow themselves to ship Flash chips with a certain number of bad blocks, it is important to scan a chip for bad blocks before the initial programming occurs." makes me wonder about what degree of discrepancy different manufacturers allow - and how much better a piece of hardware is that costs significantly more based on "brand-name" rather than the almost useless junk you can pick-up for a few pennies a GB in China are?


Cool! Was he building embedded testers, or commercial testers for semiconductor vendors? I would imagine that a lot has changed since then, especially with the shrinking process size and corresponding increase in retention failures.

In my experience, most of the reputable vendors like Micron, Samsung, and Spansion are pretty up front with how many bad blocks they allow their devices to ship with. The Micron device I've worked with most in designs allows 80 bad blocks per unit at shipment - that's about 2% of total blocks. I've never seen a brand new chip with that many bad blocks straight off the bat. (Two bad blocks in a brand new chip is unusual.)

Most of the big vendors moving to support the ONFI standard has actually had a pretty interesting effect - since NAND is in such demand and has a common interface, it's become pretty commoditized. If you want to sell chips, you have to abide by ONFI. If you do that, and your prices are competitive, your products have been just about guaranteed to sell. As a rule of thumb, however, I would agree with your sentiment. Vendors who won't share raw bit error rates or uncorrectable bit error rates are generally not worth the time of day. That's part of the reason I'm so fond of Micron and have cited them a bunch in these articles - they are very forthcoming with data about how their parts work, and how to put them into systems in such a way that they'll work over your device lifetime.


Completely agree, the industry is very commoditized, and like all SEMIs very cyclical. Build new capacity, buy test, dry-spell, repeat. Had to fill-out the bottom-line with CMOS testing, etc.

He was building commercial testers, Sytest and then Nextest, and given the cyclical nature, the consolidation in the industry circa 2008 when Teradyne swallowed Eagle & Nextest makes sense. I'll read up on ONFI - any specifics you're curious about I can ask him directly...


Does he have any sort of insight on the early washout rates of the chips they ship? I dunno if you saw the recent Carnegie Mellon/Facebook paper about NAND retention, but a major finding of that is that there's definitely a "second bathtub" in NAND devices that ship, but don't live for very long in the field. Other than that, though, most of my interest in NAND starts long after they pass out of your uncle's machines and their descendents. :)

Unrelated - I work with a bunch of former Teradyne employees.


The bad blocks are isolated from each each other so the chips are perfectly good. The NAND manufacturers bin the chips by number of defects so you can buy ones with more defects and if your firmware allows, run it at lower reported capacity. For example a 8GB chip with 1 GB of defects could be used for a 4GB flash drive with 3 GB of spare blocks. Actually this is how many of the binned parts are utilized.


Shenzhen is awash with this stuff...you can buy a 16GB keychain for $1 ... however you lucky if it works


I know enough folks working at semiconductor vendors to say with certainty that their freebie Flash drives (and whatever data you stick in them) are not long for this world. Most of the chips are infant mortalities that will live just long enough to make you think good things about getting free stuff.


There are only a few fabrication plants in the world that make nand at the current node sizes.

I do wonder if the major manufacturers dump chips that don't meet QC into no name companies. Intel does it in house under the Celeron brand. That is common in other industries, milwaukees best beer is just the bad batches of Miller.


> There are only a few fabrication plants in the world that make nand at the current node sizes.

That comment kind of assumes that making NAND as small as possible at the silicon feature level is desirable. That's not necessarily the case - in some cases, data retention is preferable to storage density. Larger feature sizes make NAND much less susceptible to retention failures and read/program disturb failures that are no big deal in SSDs, but are a huge deal in embedded applications.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: