Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone find value in SMART metrics?

In my experience, the drives report "healthy" until they fail, then they report "failed"

I've personally never tracked the detailed metrics to see if anything is predictive of impending failure, but I've never seen the overall status be anything but "healthy" unless the drive had already failed.



The SMART metrics aren't binary, and any application that is presenting them as binary (Either HEALTHY or FAILED) is doing you a disservice.

> I've personally never tracked the detailed metrics to see if anything is predictive of impending failure

Backblaze has!

https://www.backblaze.com/blog/hard-drive-smart-stats/


From that link:

From experience, we have found the following five SMART metrics indicate impending disk drive failure:

    SMART 5: Reallocated_Sector_Count.
    SMART 187: Reported_Uncorrectable_Errors.
    SMART 188: Command_Timeout.
    SMART 197: Current_Pending_Sector_Count.
    SMART 198: Offline_Uncorrectable.
That's good to know, I might start tracking that. I manage several clusters of servers and hard drive failures just seem pretty random.


I've had several hard drives that started gradually increasing a reallocated sector count, then start getting reported uncorrectable errors, then eventually just give up the ghost. Usually whenever reallocated sectors starts climbing a drive is nearing death and should be replaced as soon as possible. You might not have had corruption yet, but its coming. Once you get URE's you've lost some data.

However, one time a drive got a burst of reallocated sectors, it stabilized, then didn't have any problems for a long time. Eventually it wouldn't power on years later.


Absolutely. I've looked at the SMART data of easily over 1000 drives. Many of them ok, many of them with questionable health, many failing and many failed. The SMART data has always been a valuable indicator as to what's going on. You need to look at the actual values given by tools like smartctl or CrystalDiskInfo. Everything you need to evaluate the state of your drives is there.

I've never seen an HDD fail overnight without any indication at all.


I've had an M.2 NVMe drive start reporting bad blocks via SMART. I kept using it for non-critical storage, but replaced it as my boot drive. Obviously not the same failure pattern as spinning rust, but I was glad for the early warning anyway.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: