SCIENTIFIC-LINUX-USERS Archives

April 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jeff Siddall <[log in to unmask]>
Reply To:
Jeff Siddall <[log in to unmask]>
Date:
Wed, 24 Apr 2013 13:20:12 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (43 lines)
On 04/24/2013 11:03 AM, Joseph Areeda wrote:
> Thanks for the tips Konstantin,
>
> I assume that your recommendation for 24 hrs of memtest is cumulative
> and I can probably see the same results starting it each night when I
> quit for the day.
>
> When I mentioned SMART I was talking about the self tests not the status
> that comes up.  I've also copied large files around and checked their
> md5sum's.
>
> I played with LiveCD for 4 or 5 hours today, much of it was trying to
> install it on a different spinning hard drive.
>
> I did see one time when the SSD was shown in the disk utility but all
> the partitions were zero length.  that's where my root directory used to be.

I recently discovered that a flaky disk can really mess a system up.  I 
had an old CentOS5 machine that I recently reinstalled as SL6 because it 
was hanging frequently and eventually, after a reboot from a frozen 
state, had so many fsck errors that it would not boot.

Since upgrading to SL the hangs continued.  Nothing in the logs, and 
whenever I went to the machine after it hung it just had a sleeping 
monitor but was otherwise entirely unresponsive.

Ran memtest for 24+ hours, no errors.  But recently it threw these 
errors on the console while the monitor was _not_ asleep:

kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
kernel: ata4: irq_stat 0x00400000, PHY RDY changed
kernel: ata4: SError: { Persist PHYRdyChg 10B8B }
kernel: ata4: hard resetting link

Swapped out the drive and now everything runs smoothy.

When running pvmove with the disk installed in another machine I found a 
number of similar errors in that machine's logs but because the disk was 
not the root/swap partition drive on that machine it could reset the 
link and continue moving data.

Jeff

ATOM RSS1 RSS2