On Apr 24, 2013, at 2:39 PM, Konstantin Olchanski wrote:
> On Wed, Apr 24, 2013 at 01:27:19PM -0400, Jeff Siddall wrote:
>> On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:
>>>> disk utility show ... SMART [is] fine.
>>>>>
>>> SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly functional disks report "SMART Failure, replace your disk now".
>>
>> Agreed. SMART doesn't diagnose everything.
>>
>
> Raw data reported by SMART seems solid enough - hours of use, temperatures, bad sector counts, etc.
>
> But the "SMART overall-health self-assessment test result" is useless and
> for the purpose of predicting disk failure, all data reported by SMART is useless.
>
> Maybe one exception: when the number of bad sectors starts incrementing rapidly,
> the disk often fails soon thereafter.
>
> But more typically I see this scenario:
> in the morning - reading the email reports:
> smartctl reports increase of bad sectors
> disk is dropped from the raid array
> smartctl reports that the disk does not support smart (it's way of telling us that the disk died)
> cat mdstat shows [U_] we are now running on the spare disk
>
> In other words:
> - all disks will fail eventually
> - there is no reliable predictor for "your disk will fail in 7 days, rush to newegg now!",
> - to prevent complete data loss, implement rsync to some other disk
> - to ensure uninterrupted operation, raid all disk.
>
> This is all in my experience. Your experience may be different and if you now a source
> for "this disk will never fail" disks, please let me know.
>
> --
> Konstantin Olchanski
> Data Acquisition Systems: The Bytes Must Flow!
> Email: olchansk-at-triumf-dot-ca
> Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada
There is a well-known paper regarding Google's experience with SMART data: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf
They find a number of SMART parameters that are reasonably indicative of failure, including "Reallocated Sector", "Current Pending Sector", and "Offline Uncorrectable" counts. That said, IIRC, SMART only predicted failures around 30% of the time.
--Lincoln
|