As a quick refresher before I make some recommendations to a customer, I got to revisit the discussion of hard drives and how they like to fail.
Three articles on the StorageMojo blog are brilliant and worth a read or a revisit:
- NetApp Weighs In On Disks – Val Bercovici of Netapp weighs in on disk failure surveys from Google and CMU’s Parallel Data lab. Beware some marketing spin mixed in with some real gems of knowledge.
- Google’s Disk Failure Experience: Google’s analysis of failure rates among 100,000 drives. It’s worth noting that Google uses cheap SATA drives almost exclusively in the cloud.
- Everything You Know About Disks Is Wrong: The aforementioned CMU study, which has a nice comparison between cheapo SATA disks and fancy FC disks.
There are some critically important takeways in these articles and the surrounding commentary that we must never forget when safeguarding our data. Most imporant points:
- MTBF is a nearly irrelevant number. Storage experts are contending that Mean Time Beteween Failures is actually much closer between “consumer” SATA drives and “enterprise” FC drives than the marketing people want you to believe.
- There is a huge amount of magic and complexity happening inside of every hard drive: Really, almost all of it is masked from users. Hard drive controllers and their respective firmware has gotten insanely complex to keep up the increasing number of failure and error scenarios as disks get more dense. The difference between two disk firmware revisions or code branches can make a considerable impact on drive reliability.
- The REAL difference between the low and high-end is RAID controller and drive firmware smarts: Why do disks and RAID controllers form storage vendors cost so much more? Because those storage vendors are on the hook for a complete product, and put engineering time into changing the disk behaviour by customizing drive firmware and pre-qualifying drives. They also know that SMART is a sham, predicting only a small fraction of disk failures. The secret sauce is performing advanced failure profiling on the RAID controller, and coordinating it with fully understood and tuned drive firmware. This is the true difference between your cheapo Promise variety RAID setups and those from Netapp, Hitachi, EMC and HP.
- Drive failure rates do go up with age: Heavily used drives are either going to fail in the first three months, or with steadily increasing risk beyond the 3 year mark. We’ve seen this in the field as well, even early $1000/drive 300GB FC disks had a 10% failure rate in the first 3 months after you started working them hard. Drives older than 3 years have typically been spinning, without stopping, for that entire time period. Bad Things are known to happen when you spin them down, let them cool, and then try to spin them up again.
- Background media scanning is the best way to detect drive failures: Does your RAID controller perform background media scanning or at least a full array consistency check on a regular basis? Great! Because THIS is the real way to predict disk failure, by monitoring and trending subtle disk failures (bad sectors, etc), rather than waiting for SMART to (not) predict a major mechanical failure. As it was said above, SMART is mostly useless. Hopefully you’re not feeling quite so good about that “S.M.A.R.T. Status OK” message anymore.
- RAID 5 is more harmful than you think: Many numbers are starting to come out about how often double failures are happening in RAID5 arrays. It’s quite disturbing, but you run a HUGE risk of data loss during that high-activity array rebuild after your first disk has failed. And, as pointed out, too many people rely on RAID5 as a backup solution. So let’s say it again RAID5 is NOT a backup solution, and never will be.