SCIENTIFIC-LINUX-USERS Archives

April 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Yasha Karant <[log in to unmask]>
Reply To:
Yasha Karant <[log in to unmask]>
Date:
Wed, 24 Apr 2013 09:10:25 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (82 lines)
A small comment:  stress testing is cumulative only if the underlying 
system has no recovery mechanism. (An understanding of this in detail 
requires non-equilibrium statistical mechanics but can be summarized 
with non-equilibrium "thermodynamics").  My experience with failing 
electronics and magnetics -- depending upon the exact failure mode -- is 
that non-interrupted stress testing is better than interrupted in terms 
of finding failures.  A simple example: suppose a failure mode is 
temperature dependent, and temperature depends upon the amount of work 
being done.  An interrupted but cumulative stress test might never reach 
the "critical" temperature, whereas a continued stress test might.

Yasha Karant

On 04/24/2013 08:03 AM, Joseph Areeda wrote:
> Thanks for the tips Konstantin,
>
> I assume that your recommendation for 24 hrs of memtest is cumulative
> and I can probably see the same results starting it each night when I
> quit for the day.
>
> When I mentioned SMART I was talking about the self tests not the status
> that comes up.  I've also copied large files around and checked their
> md5sum's.
>
> I played with LiveCD for 4 or 5 hours today, much of it was trying to
> install it on a different spinning hard drive.
>
> I did see one time when the SSD was shown in the disk utility but all
> the partitions were zero length.  that's where my root directory used to be.
>
> I also found that the nvidia drivers in ELREPO don't seem to work with
> 6.4.  I seem to be able to run fine (at least for a while) unless I
> install kmod-nvidia then I get a kernal panic on the next reboot (3
> times until I tracked it down).  It saiys something like "not syncing
> attempt xxx(can't read my writing) PID 1 comm init not tainted
> 2.6.32.258.2.1.  That's another problem I think.
>
> Right now I suspect not necessarily in order:
>
>   * Bad SSD.  Run time is reported as 1.8 years.  I did have /usr
>     /usr/local /tmp swap and /home on spinning media but...
>   * Bad memory:  still a good possiblity
>   * Some insidious incompatibility with all packages from multiple
>     repos.  I really hope it's not that, I don't load much I don't need.
>
> And as for finding a real computer repairman, let me know if you have
> one in Los Angeles.  This is similar to a problem I had with an iMac.
> The geniuses at the store took three trips to convince them something
> was wrong and that was after about an hour each time with the phone
> support people.  That one turned out to be a flaky memory DIMM that
> passed all the quick diagnostics.
>
> Oh well the saga continues.  It's nice have a group to go to for ideas.
> Thank you all.
>
> Joe
>
>
> On 04/23/2013 04:20 PM, Konstantin Olchanski wrote:
>> On Tue, Apr 23, 2013 at 11:44:22AM -0700, Joseph Areeda wrote:
>>> I'm having this strange behavior that I think is a hardware problem ...
>>> * System freezes, mouse and keyboard dead, sshd unresponsive sometimes
>>>
>> First action is to run memtest86 (Q: which one? google finds several. A: all of them).
>>
>> Run memtest86 for 24 hours at least - if it reports memory errors, hangs, freezes or
>> machine turns off, you definitely have a hardware problem. Suspect parts
>> are in this order: RAM, power supply, CPU socket (bent pins), mobo, CPU.
>>
>> If memtest86 runs fine for 24 hours and more, there *still* could be a hardware
>> problem. (memtest86 does not test the video, the disk, the network
>> and the usb interfaces).
>>
>>> disk utility show ... SMART [is] fine.
>>>
>> SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly functional disks report "SMART Failure, replace your disk now".
>>
>> This is free advice. For advice that would actually get your computer
>> working again, you would want to hire a proper computer repairman.
>>
>

ATOM RSS1 RSS2