SCIENTIFIC-LINUX-USERS Archives

April 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Joseph Areeda <[log in to unmask]>
Reply To:
Joseph Areeda <[log in to unmask]>
Date:
Wed, 24 Apr 2013 10:21:55 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (127 lines)
I can't thank you all enough for bearing with me as I stumble my way 
through this.

I now understand the logic behind running memtest uninterrupted for a 
long period (>24hr) and will do that.

I have to take back my comment about kmod-nvidia.  I repeatedly messed 
up /etc/selinux/config trying to disable it and that it what was causing 
the kernel panics.  I suppose that's a sign I'm not paying enough attention.

The purpose of running from LiveCD is not to necessarily find a hardware 
problem but to remove the hard disks and the installed software from the 
equation.  The idea being IF I got one of these rare and random failures 
while running that way I could rule out insidious package conflicts, 
mangled configurations and the system disk as the cause.

As far as finding a computer repair professional whom I would go to for 
a problem like this, well all I can say is I've been living in this town 
for 32 years working in computing, I do have an outstanding doctor, a 
great car mechanic, an exceptional plumber... but I haven't found a 
computer guy better than me at this.  That is not to imply that I am any 
good at it.

I am now up and running with SL6.4 on a spinning disk (to remove the SSD 
and a bunch of useful and need packages from the equation). I'll try to 
get some work done today and see if it crashes.

My next step is to swap memory and GPU with another box and see if the 
problem follows.

I hope I'm not posting too much useless (to others) information to the list.

Joe

On 04/24/2013 09:10 AM, Yasha Karant wrote:
> A small comment:  stress testing is cumulative only if the underlying 
> system has no recovery mechanism. (An understanding of this in detail 
> requires non-equilibrium statistical mechanics but can be summarized 
> with non-equilibrium "thermodynamics").  My experience with failing 
> electronics and magnetics -- depending upon the exact failure mode -- 
> is that non-interrupted stress testing is better than interrupted in 
> terms of finding failures.  A simple example: suppose a failure mode 
> is temperature dependent, and temperature depends upon the amount of 
> work being done.  An interrupted but cumulative stress test might 
> never reach the "critical" temperature, whereas a continued stress 
> test might.
>
> Yasha Karant
>
> On 04/24/2013 08:03 AM, Joseph Areeda wrote:
>> Thanks for the tips Konstantin,
>>
>> I assume that your recommendation for 24 hrs of memtest is cumulative
>> and I can probably see the same results starting it each night when I
>> quit for the day.
>>
>> When I mentioned SMART I was talking about the self tests not the status
>> that comes up.  I've also copied large files around and checked their
>> md5sum's.
>>
>> I played with LiveCD for 4 or 5 hours today, much of it was trying to
>> install it on a different spinning hard drive.
>>
>> I did see one time when the SSD was shown in the disk utility but all
>> the partitions were zero length.  that's where my root directory used 
>> to be.
>>
>> I also found that the nvidia drivers in ELREPO don't seem to work with
>> 6.4.  I seem to be able to run fine (at least for a while) unless I
>> install kmod-nvidia then I get a kernal panic on the next reboot (3
>> times until I tracked it down).  It saiys something like "not syncing
>> attempt xxx(can't read my writing) PID 1 comm init not tainted
>> 2.6.32.258.2.1.  That's another problem I think.
>>
>> Right now I suspect not necessarily in order:
>>
>>   * Bad SSD.  Run time is reported as 1.8 years.  I did have /usr
>>     /usr/local /tmp swap and /home on spinning media but...
>>   * Bad memory:  still a good possiblity
>>   * Some insidious incompatibility with all packages from multiple
>>     repos.  I really hope it's not that, I don't load much I don't need.
>>
>> And as for finding a real computer repairman, let me know if you have
>> one in Los Angeles.  This is similar to a problem I had with an iMac.
>> The geniuses at the store took three trips to convince them something
>> was wrong and that was after about an hour each time with the phone
>> support people.  That one turned out to be a flaky memory DIMM that
>> passed all the quick diagnostics.
>>
>> Oh well the saga continues.  It's nice have a group to go to for ideas.
>> Thank you all.
>>
>> Joe
>>
>>
>> On 04/23/2013 04:20 PM, Konstantin Olchanski wrote:
>>> On Tue, Apr 23, 2013 at 11:44:22AM -0700, Joseph Areeda wrote:
>>>> I'm having this strange behavior that I think is a hardware problem 
>>>> ...
>>>> * System freezes, mouse and keyboard dead, sshd unresponsive sometimes
>>>>
>>> First action is to run memtest86 (Q: which one? google finds 
>>> several. A: all of them).
>>>
>>> Run memtest86 for 24 hours at least - if it reports memory errors, 
>>> hangs, freezes or
>>> machine turns off, you definitely have a hardware problem. Suspect 
>>> parts
>>> are in this order: RAM, power supply, CPU socket (bent pins), mobo, 
>>> CPU.
>>>
>>> If memtest86 runs fine for 24 hours and more, there *still* could be 
>>> a hardware
>>> problem. (memtest86 does not test the video, the disk, the network
>>> and the usb interfaces).
>>>
>>>> disk utility show ... SMART [is] fine.
>>>>
>>> SMART "health report" is useless. I had dead disks report "SMART OK" 
>>> and perfectly functional disks report "SMART Failure, replace your 
>>> disk now".
>>>
>>> This is free advice. For advice that would actually get your computer
>>> working again, you would want to hire a proper computer repairman.
>>>
>>

ATOM RSS1 RSS2