LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

April 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS April 2013

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Help finding a hardware problem (I think)
From:	Jeff Siddall <[log in to unmask]>
Reply To:	Jeff Siddall <[log in to unmask]>
Date:	Wed, 24 Apr 2013 13:20:12 -0400
Content-Type:	text/plain
Parts/Attachments:	text/plain (43 lines)

On 04/24/2013 11:03 AM, Joseph Areeda wrote:
> Thanks for the tips Konstantin,
>
> I assume that your recommendation for 24 hrs of memtest is cumulative
> and I can probably see the same results starting it each night when I
> quit for the day.
>
> When I mentioned SMART I was talking about the self tests not the status
> that comes up.  I've also copied large files around and checked their
> md5sum's.
>
> I played with LiveCD for 4 or 5 hours today, much of it was trying to
> install it on a different spinning hard drive.
>
> I did see one time when the SSD was shown in the disk utility but all
> the partitions were zero length.  that's where my root directory used to be.

I recently discovered that a flaky disk can really mess a system up.  I 
had an old CentOS5 machine that I recently reinstalled as SL6 because it 
was hanging frequently and eventually, after a reboot from a frozen 
state, had so many fsck errors that it would not boot.

Since upgrading to SL the hangs continued.  Nothing in the logs, and 
whenever I went to the machine after it hung it just had a sleeping 
monitor but was otherwise entirely unresponsive.

Ran memtest for 24+ hours, no errors.  But recently it threw these 
errors on the console while the monitor was _not_ asleep:

kernel: ata4: exception Emask 0x10 SAct 0x0 SErr 0x90200 action 0xe frozen
kernel: ata4: irq_stat 0x00400000, PHY RDY changed
kernel: ata4: SError: { Persist PHYRdyChg 10B8B }
kernel: ata4: hard resetting link

Swapped out the drive and now everything runs smoothy.

When running pvmove with the disk installed in another machine I found a 
number of similar errors in that machine's logs but because the disk was 
not the root/swap partition drive on that machine it could reset the 
link and continue moving data.

Jeff

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV