SCIENTIFIC-LINUX-USERS Archives

March 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
S Senator <[log in to unmask]>
Reply To:
Date:
Sat, 18 Mar 2006 14:19:52 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (55 lines)
Hello,

I am sending this note to describe a problem that we have had with Intel
Xeon processors when hyperthreading is enabled. Specifically, enabling
hyperthreading on the CPUs causes the machines to lock up under loads when
running highly-parallelized code. We have had random node lockups since
October 2005, when we took receipt of some parts to upgrade the majority
of our nodes. I am sending this summary note so that anyone else who
encounters a similar problem will have a starting point for research.

Our older nodes are running Scientific Linux 3 and White Box Enterprise
Linux 3. The newer ones have had Scientific Linux 4.1 and 4.2. We have
also tried Scientific Linux 3 and White Box Enterprise Linux 3 on the
newer nodes, which did not prevent the problem. We have also tried:
1. reducing the room's temperature to ~60 degrees Fahnrenheit, so that the
exhausted air from the nodes is ~76 degrees
2. swapping out memory
3. repositioning the airflow into and out from the rack
4. Updating the BIOS (see below for version #s)
None of the above made any difference.

Our applications are highly parallelized, using MPICH running over gigabit
ethernet. They are limited by CPU and memory rather than network or file
i/o. CPU usage is consistently >95% on both processors when the machines
would lock up. We never encountered any problems under lower CPU load or
less-parallelized code. A code such as this causes a node lockup in ~3-6
hours.

We had some older nodes of the following configuration that did not
exhibit this problem.
Older node configuration:
2.8 Ghz Xeon (2 CPUs/node)
2Gb RAM
SuperMicro XPGA-GG motherboard
AMI BIOS v. 8.00.09
White Box Enterprise Linux 3 (liberation respin 2)or Scientific Linux
Enterprise Linux 3 (3.05)
differences from factory BIOS settings and our configuration:
- enable "Legacy USB devices at boot"
- minor changes in boot ordering
  (network boot during installation, boot from hard drive post installation)
- disable hyperthreading

Newer node configuration:
2.8 Ghz Xeon (2 CPUs/node
4Gb RAM
Tyan Tiger i7320 S5350 motherboard
BIOS version 1.04 (as shipped), upgraded to version 1.07 or 1.08 (tried both)

Feel free to contact me if you have any questions,

-Steve Senator
 USAF Academy
 Modeling & Simulation Research Center

ATOM RSS1 RSS2