SCIENTIFIC-LINUX-USERS Archives

February 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Miles O'Neal <[log in to unmask]>
Reply To:
Miles O'Neal <[log in to unmask]>
Date:
Thu, 10 Feb 2005 16:32:35 -0600
Content-Type:
text/plain
Parts/Attachments:
text/plain (47 lines)
Michael David Joy said...

|Regardless of the fact that the system is probably new, you might try
|checking a few hardware issues.
|
|Two that I'd suggest are the memory and memory config. I've had at least
|one dual opteron system 'eat' memory (ie ecc memory errors on one of the
|cpu memory banks). The replacement memory failed too, turns out it was a
|bad cpu that was somehow frying the memory via the memory bus. AMD is
|still investigating the failure. I'd guess that it's tied to the
|integrated memory controller. The CPU finally failed after 3 months of
|operation.
|
|Anyways, try putting all the memory in one cpu's memory bank if it's a
|Numa model (separate bank of memory per CPU) to disable the Numa
|abilities.

Both banks are full.  Commensurate with the
OS upgrade we added more RAM.  But...

|Also, you might try memtest86 and see if one of the memory modules is
|definitely bad. I've had more than a few memory modules test good and
|finally fail after a month or two of running under a full load.

...the system worked fine before the RAM
and OS upgrade (running SuSE LES 8), and
works fine now, other than this.  And the
same thing happens regardless of other
system activity.  The first time, it
happened on a freshly booted, quiescent
system.  It's since happened on a system
where over half the RAM was in use by
one process...

Thanks, though.  We will watch and see
if there's anything else HW-wise.  I almost
ran memtest after the upgrade, but one of
our engineers was in a bind for this system.

Since it's a compute server and the only
problem so far is that nautilus barfs, I
don't plan to take the system down until
after the crunch...

Thanks,
Miles

ATOM RSS1 RSS2