SCIENTIFIC-LINUX-USERS Archives

April 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Lincoln Bryant <[log in to unmask]>
Reply To:
Lincoln Bryant <[log in to unmask]>
Date:
Wed, 24 Apr 2013 16:03:38 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (46 lines)
On Apr 24, 2013, at 2:39 PM, Konstantin Olchanski wrote:

> On Wed, Apr 24, 2013 at 01:27:19PM -0400, Jeff Siddall wrote:
>> On 04/23/2013 07:20 PM, Konstantin Olchanski wrote:
>>>> disk utility show ... SMART [is] fine.
>>>>> 
>>> SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly functional disks report "SMART Failure, replace your disk now".
>> 
>> Agreed.  SMART doesn't diagnose everything.
>> 
> 
> Raw data reported by SMART seems solid enough - hours of use, temperatures, bad sector counts, etc.
> 
> But the "SMART overall-health self-assessment test result" is useless and
> for the purpose of predicting disk failure, all data reported by SMART is useless.
> 
> Maybe one exception: when the number of bad sectors starts incrementing rapidly,
> the disk often fails soon thereafter.
> 
> But more typically I see this scenario:
> in the morning - reading the email reports:
> smartctl reports increase of bad sectors
> disk is dropped from the raid array
> smartctl reports that the disk does not support smart (it's way of telling us that the disk died)
> cat mdstat shows [U_] we are now running on the spare disk
> 
> In other words:
> - all disks will fail eventually
> - there is no reliable predictor for "your disk will fail in 7 days, rush to newegg now!", 
> - to prevent complete data loss, implement rsync to some other disk
> - to ensure uninterrupted operation, raid all disk.
> 
> This is all in my experience. Your experience may be different and if you now a source
> for "this disk will never fail" disks, please let me know.
> 
> -- 
> Konstantin Olchanski
> Data Acquisition Systems: The Bytes Must Flow!
> Email: olchansk-at-triumf-dot-ca
> Snail mail: 4004 Wesbrook Mall, TRIUMF, Vancouver, B.C., V6T 2A3, Canada

There is a well-known paper regarding Google's experience with SMART data: http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/archive/disk_failures.pdf

They find a number of SMART parameters that are reasonably indicative of failure, including "Reallocated Sector", "Current Pending Sector", and "Offline Uncorrectable" counts. That said, IIRC, SMART only predicted failures around 30% of the time.

--Lincoln

ATOM RSS1 RSS2