On Tue, Oct 16, 2018 at 8:20 PM Radha Mohan <[log in to unmask]> wrote: > > On Tue, Oct 16, 2018 at 6:09 PM Paul Robert Marino <[log in to unmask]> wrote: > > > > to be clear I wasn't saying Smart is useless just that smartctl doesn't always tell you every thing so you shouldn't rely as a definitive answer on all issues on all disks. > > > > As for raid controllers well that's a very long conversation there are good reasons the enterprise ones do not, at least not directly in a way you can extract using the smartctl command instead they have more advanced checks available through the drivers and additional monitoring tools provided by the manufacturer of the raid controller. > > > > as for the predictive nature of smart well that's actually in its specification it predicts errors based on indicators. > > > > On Tue, Oct 16, 2018 at 7:55 PM Konstantin Olchanski <[log in to unmask]> wrote: > >> > >> On Tue, Oct 16, 2018 at 04:20:03PM -0400, Paul Robert Marino wrote: > >> > > >> > smart is predictive and doesn't catch all errors its also not compatible > >> > with all disks and controllers especially raid capable controllers. > >> > > >> > >> > >> Do not reject SMART as useless, it correctly reports many actual disk failures: > >> > >> a) overheating (actual disk temperature is reported in degrees Centigrade) > >> b) unreadable sectors (data on these sectors is already lost) - disk model dependant > >> c) "hard to read" sectors (WD specific - "raw read error rate") > >> d) sata link communication errors ("CRC error count") > >> > >> even more useful actual (*not* predictive) stuff is reported for SSDs (again, model dependant) > >> > >> it is true that much of this information is disk model dependant and > >> one has to have some experience with the SMART data to be able > >> to read it in a meaningful way. > >> > >> as for raid controllers that prevent access to disk SMART data, > >> they are as safe to use a car with a blank dashboard (no fuel level, > >> no engine temperature, no speedometer, etc). > >> > > Posting " smartctl -a" output below. > Also just wanted to mention that I have only single disk on my > machine. So the disk has not failed. I was able to restart the machine > lot of times and the OS came up nice. > > # smartctl -a /dev/sda > smartctl 6.2 2017-02-27 r4394 > [x86_64-linux-3.10.0-862.14.4.el7.x86_64] (local build) > Copyright (C) 2002-13, Bruce Allen, Christian Franke, https://urldefense.proofpoint.com/v2/url?u=http-3A__www.smartmontools.org&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=kQ8vvayrVpln1ARGxS9sNz5F4E2AypuC4yVAsVT_nO4&s=994EuiJp86AYjKROB14_SvOCF1tiERWKXFElbEjMDvo&e= > > === START OF INFORMATION SECTION === > Model Family: Toshiba 3.5" MG03ACAxxx(Y) Enterprise HDD > Device Model: TOSHIBA MG03ACA100 > Serial Number: 46SIKCQFF > LU WWN Device Id: 5 000039 6fbf81f8b > Add. Product Id: DELL(tm) > Firmware Version: FL2H > User Capacity: 1,000,204,886,016 bytes [1.00 TB] > Sector Size: 512 bytes logical/physical > Rotation Rate: 7200 rpm > Device is: In smartctl database [for details use: -P show] > ATA Version is: ATA8-ACS (minor revision not indicated) > SATA Version is: SATA 3.0, 6.0 Gb/s (current: 6.0 Gb/s) > Local Time is: Tue Oct 16 20:17:41 2018 PDT > SMART support is: Available - device has SMART capability. > SMART support is: Enabled > > === START OF READ SMART DATA SECTION === > SMART overall-health self-assessment test result: PASSED > Warning: This result is based on an Attribute check. > > General SMART Values: > Offline data collection status: (0x85) Offline data collection activity > was aborted by an interrupting command from host. > Auto Offline Data Collection: Enabled. > Self-test execution status: ( 0) The previous self-test routine completed > without error or no self-test has ever > been run. > Total time to complete Offline > data collection: ( 90) seconds. > Offline data collection > capabilities: (0x5b) SMART execute Offline immediate. > Auto Offline data collection on/off support. > Suspend Offline collection upon new > command. > Offline surface scan supported. > Self-test supported. > No Conveyance Self-test supported. > Selective Self-test supported. > SMART capabilities: (0x0003) Saves SMART data before entering > power-saving mode. > Supports SMART auto save timer. > Error logging capability: (0x01) Error logging supported. > General Purpose Logging supported. > Short self-test routine > recommended polling time: ( 2) minutes. > Extended self-test routine > recommended polling time: ( 164) minutes. > SCT capabilities: (0x003d) SCT Status supported. > SCT Error Recovery Control supported. > SCT Feature Control supported. > SCT Data Table supported. > > SMART Attributes Data Structure revision number: 16 > Vendor Specific SMART Attributes with Thresholds: > ID# ATTRIBUTE_NAME FLAG VALUE WORST THRESH TYPE > UPDATED WHEN_FAILED RAW_VALUE > 1 Raw_Read_Error_Rate 0x000b 100 100 050 Pre-fail > Always - 0 > 2 Throughput_Performance 0x0004 100 100 000 Old_age > Offline - 0 > 3 Spin_Up_Time 0x0027 100 100 001 Pre-fail > Always - 4211 > 4 Start_Stop_Count 0x0032 100 100 000 Old_age > Always - 26 > 5 Reallocated_Sector_Ct 0x0033 100 100 050 Pre-fail > Always - 0 > 7 Seek_Error_Rate 0x000a 100 100 000 Old_age > Always - 0 > 8 Seek_Time_Performance 0x0004 100 100 000 Old_age > Offline - 0 > 9 Power_On_Hours 0x0032 051 051 000 Old_age > Always - 19725 > 10 Spin_Retry_Count 0x0032 100 100 000 Old_age > Always - 0 > 12 Power_Cycle_Count 0x0032 100 100 000 Old_age > Always - 26 > 192 Power-Off_Retract_Count 0x0032 100 100 000 Old_age > Always - 25 > 193 Load_Cycle_Count 0x0032 100 100 000 Old_age > Always - 26 > 194 Temperature_Celsius 0x0022 100 100 000 Old_age > Always - 32 (Min/Max 20/37) > 196 Reallocated_Event_Count 0x0032 100 100 000 Old_age > Always - 0 > 198 Offline_Uncorrectable 0x0030 100 100 000 Old_age > Offline - 0 > 199 UDMA_CRC_Error_Count 0x0032 200 200 000 Old_age > Always - 0 > 241 Total_LBAs_Written 0x0032 100 100 000 Old_age > Always - 2347506755 > 242 Total_LBAs_Read 0x0032 100 100 000 Old_age > Always - 125819370 > > SMART Error Log Version: 1 > No Errors Logged > > SMART Self-test log structure revision number 1 > Num Test_Description Status Remaining > LifeTime(hours) LBA_of_first_error > # 1 Short offline Completed without error 00% 19240 - > # 2 Extended offline Completed without error 00% 17252 - > # 3 Short offline Completed without error 00% 17248 - > # 4 Short offline Completed without error 00% 2 - > # 5 Vendor (0xdf) Completed without error 00% 2 - > > SMART Selective self-test log data structure revision number 1 > SPAN MIN_LBA MAX_LBA CURRENT_TEST_STATUS > 1 0 0 Not_testing > 2 0 0 Not_testing > 3 0 0 Not_testing > 4 0 0 Not_testing > 5 0 0 Not_testing > Selective self-test flags (0x0): > After scanning selected spans, do NOT read-scan remainder of disk. > If Selective self-test is pending on power-up, resume after 0 minute delay. > # > > >> After digging around the internet for more I found one more command "# smartctl -r ioctl,2 -q noserial -d sat -H /dev/sda" This gave a very long output but at the end gave this: === START OF READ SMART DATA SECTION === REPORT-IOCTL: Device=/dev/sda Command=SMART STATUS CHECK Input: FR=0xda, SC=...., LL=...., LM=0x4f, LH=0xc2, DEV=...., CMD=0xb0 [ata pass-through(16): 85 06 2c 00 da 00 00 00 00 00 4f 00 c2 00 b0 00 ] scsi_status=0x0, host_status=0x4, driver_status=0x0 info=0x1 duration=23 milliseconds resid=0 sat_device::ata_pass_through: scsi_pass_through() failed, errno=5 [Input/output error] [Duration: 0.023s] REPORT-IOCTL: Device=/dev/sda Command=SMART STATUS CHECK returned -1 errno=5 [Input/output error] SMART overall-health self-assessment test result: PASSED Warning: This result is based on an Attribute check. ------------------------- Based on above and discussion at https://urldefense.proofpoint.com/v2/url?u=https-3A__sourceforge.net_p_smartmontools_mailman_message_31516819_&d=DwIFaQ&c=gRgGjJ3BkIsb5y6s49QqsA&r=gd8BzeSQcySVxr0gDWSEbN-P-pgDXkdyCtaMqdCgPPdW1cyL5RIpaIYrCn8C5x2A&m=kQ8vvayrVpln1ARGxS9sNz5F4E2AypuC4yVAsVT_nO4&s=Wehwe-jGCdLk6vKXPB0_Li2KnkcYiIKLvkmPSutdYqs&e= I came to conclusion that disk isn't bad at all. It's just that the certain ATA passthrough command doesn't pass because either the controller or something doesn't support it. A failure that can be ignored :) - RM