Subject: | |
From: | |
Reply To: | John A. Goebel |
Date: | Thu, 10 Feb 2005 14:54:31 -0800 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
++ 10/02/05 16:32 -0600 - <Miles O'Neal>:
Hey Miles,
> Michael David Joy said...
>
> |Regardless of the fact that the system is probably new, you might try
> |checking a few hardware issues.
> |
> |Two that I'd suggest are the memory and memory config. I've had at least
> |one dual opteron system 'eat' memory (ie ecc memory errors on one of the
> |cpu memory banks). The replacement memory failed too, turns out it was a
> |bad cpu that was somehow frying the memory via the memory bus. AMD is
> |still investigating the failure. I'd guess that it's tied to the
> |integrated memory controller. The CPU finally failed after 3 months of
> |operation.
> |
> |Anyways, try putting all the memory in one cpu's memory bank if it's a
> |Numa model (separate bank of memory per CPU) to disable the Numa
> |abilities.
>
> Both banks are full. Commensurate with the
> OS upgrade we added more RAM. But...
>
> |Also, you might try memtest86 and see if one of the memory modules is
> |definitely bad. I've had more than a few memory modules test good and
> |finally fail after a month or two of running under a full load.
>
> ...the system worked fine before the RAM
> and OS upgrade (running SuSE LES 8), and
> works fine now, other than this. And the
> same thing happens regardless of other
> system activity. The first time, it
> happened on a freshly booted, quiescent
> system. It's since happened on a system
> where over half the RAM was in use by
> one process...
>
> Thanks, though. We will watch and see
> if there's anything else HW-wise. I almost
> ran memtest after the upgrade, but one of
> our engineers was in a bind for this system.
>
> Since it's a compute server and the only
> problem so far is that nautilus barfs, I
> don't plan to take the system down until
> after the crunch...
>
> Thanks,
> Miles
You can always look in the IPMI sensor data and the SEL via the SP or in the
platform operating system (try FreeIPMI to get hardware level logging). No
kernel module needed.
Also, if this system isn't in production and you have a little time, you get
get the Knoppix CD and boot into memtest86. It will report most problems with
RAM.
Just helpful(hopefully) hint from Heloise,
John
##############################################
# John Goebel <jgoebel(at)slac.stanford.edu> #
# Stanford Linear Accelerator Center #
# 2575 Sand Hill Road, Menlo Park, CA 94025 #
############################################ #
|
|
|