SCIENTIFIC-LINUX-USERS Archives

February 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"John A. Goebel" <[log in to unmask]>
Reply To:
John A. Goebel
Date:
Thu, 10 Feb 2005 14:54:31 -0800
Content-Type:
text/plain
Parts/Attachments:
text/plain (69 lines)
++ 10/02/05 16:32 -0600 - <Miles O'Neal>:

Hey Miles,

> Michael David Joy said...
> 
> |Regardless of the fact that the system is probably new, you might try
> |checking a few hardware issues.
> |
> |Two that I'd suggest are the memory and memory config. I've had at least
> |one dual opteron system 'eat' memory (ie ecc memory errors on one of the
> |cpu memory banks). The replacement memory failed too, turns out it was a
> |bad cpu that was somehow frying the memory via the memory bus. AMD is
> |still investigating the failure. I'd guess that it's tied to the
> |integrated memory controller. The CPU finally failed after 3 months of
> |operation.
> |
> |Anyways, try putting all the memory in one cpu's memory bank if it's a
> |Numa model (separate bank of memory per CPU) to disable the Numa
> |abilities.
> 
> Both banks are full.  Commensurate with the
> OS upgrade we added more RAM.  But...
> 
> |Also, you might try memtest86 and see if one of the memory modules is
> |definitely bad. I've had more than a few memory modules test good and
> |finally fail after a month or two of running under a full load.
> 
> ...the system worked fine before the RAM
> and OS upgrade (running SuSE LES 8), and
> works fine now, other than this.  And the
> same thing happens regardless of other
> system activity.  The first time, it
> happened on a freshly booted, quiescent
> system.  It's since happened on a system
> where over half the RAM was in use by
> one process...
> 
> Thanks, though.  We will watch and see
> if there's anything else HW-wise.  I almost
> ran memtest after the upgrade, but one of
> our engineers was in a bind for this system.
> 
> Since it's a compute server and the only
> problem so far is that nautilus barfs, I
> don't plan to take the system down until
> after the crunch...
> 
> Thanks,
> Miles

You can always look in the IPMI sensor data and the SEL via the SP or in the
platform operating system (try FreeIPMI to get hardware level logging). No
kernel module needed.

Also, if this system isn't in production and you have a little time, you get
get the Knoppix CD and boot into memtest86. It will report most problems with
RAM.

Just helpful(hopefully) hint from Heloise,

John

##############################################
# John Goebel <jgoebel(at)slac.stanford.edu> #
# Stanford Linear Accelerator Center         #
# 2575 Sand Hill Road, Menlo Park, CA 94025  #
############################################ #

ATOM RSS1 RSS2