A small comment: stress testing is cumulative only if the underlying
system has no recovery mechanism. (An understanding of this in detail
requires non-equilibrium statistical mechanics but can be summarized
with non-equilibrium "thermodynamics"). My experience with failing
electronics and magnetics -- depending upon the exact failure mode -- is
that non-interrupted stress testing is better than interrupted in terms
of finding failures. A simple example: suppose a failure mode is
temperature dependent, and temperature depends upon the amount of work
being done. An interrupted but cumulative stress test might never reach
the "critical" temperature, whereas a continued stress test might.
Yasha Karant
On 04/24/2013 08:03 AM, Joseph Areeda wrote:
> Thanks for the tips Konstantin,
>
> I assume that your recommendation for 24 hrs of memtest is cumulative
> and I can probably see the same results starting it each night when I
> quit for the day.
>
> When I mentioned SMART I was talking about the self tests not the status
> that comes up. I've also copied large files around and checked their
> md5sum's.
>
> I played with LiveCD for 4 or 5 hours today, much of it was trying to
> install it on a different spinning hard drive.
>
> I did see one time when the SSD was shown in the disk utility but all
> the partitions were zero length. that's where my root directory used to be.
>
> I also found that the nvidia drivers in ELREPO don't seem to work with
> 6.4. I seem to be able to run fine (at least for a while) unless I
> install kmod-nvidia then I get a kernal panic on the next reboot (3
> times until I tracked it down). It saiys something like "not syncing
> attempt xxx(can't read my writing) PID 1 comm init not tainted
> 2.6.32.258.2.1. That's another problem I think.
>
> Right now I suspect not necessarily in order:
>
> * Bad SSD. Run time is reported as 1.8 years. I did have /usr
> /usr/local /tmp swap and /home on spinning media but...
> * Bad memory: still a good possiblity
> * Some insidious incompatibility with all packages from multiple
> repos. I really hope it's not that, I don't load much I don't need.
>
> And as for finding a real computer repairman, let me know if you have
> one in Los Angeles. This is similar to a problem I had with an iMac.
> The geniuses at the store took three trips to convince them something
> was wrong and that was after about an hour each time with the phone
> support people. That one turned out to be a flaky memory DIMM that
> passed all the quick diagnostics.
>
> Oh well the saga continues. It's nice have a group to go to for ideas.
> Thank you all.
>
> Joe
>
>
> On 04/23/2013 04:20 PM, Konstantin Olchanski wrote:
>> On Tue, Apr 23, 2013 at 11:44:22AM -0700, Joseph Areeda wrote:
>>> I'm having this strange behavior that I think is a hardware problem ...
>>> * System freezes, mouse and keyboard dead, sshd unresponsive sometimes
>>>
>> First action is to run memtest86 (Q: which one? google finds several. A: all of them).
>>
>> Run memtest86 for 24 hours at least - if it reports memory errors, hangs, freezes or
>> machine turns off, you definitely have a hardware problem. Suspect parts
>> are in this order: RAM, power supply, CPU socket (bent pins), mobo, CPU.
>>
>> If memtest86 runs fine for 24 hours and more, there *still* could be a hardware
>> problem. (memtest86 does not test the video, the disk, the network
>> and the usb interfaces).
>>
>>> disk utility show ... SMART [is] fine.
>>>
>> SMART "health report" is useless. I had dead disks report "SMART OK" and perfectly functional disks report "SMART Failure, replace your disk now".
>>
>> This is free advice. For advice that would actually get your computer
>> working again, you would want to hire a proper computer repairman.
>>
>
|