SCIENTIFIC-LINUX-USERS Archives

August 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Pann McCuaig <[log in to unmask]>
Reply To:
Pann McCuaig <[log in to unmask]>
Date:
Mon, 14 Aug 2006 16:13:42 -0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (71 lines)
Greetings!

I'm having an occasional problem with one host in a cluster of nine. The
eight hosts that are not giving me this headache are identical (except
for amount of RAM) Sun Fire V20z with dual Opteron 850 CPUs and either
4G or 8G of RAM. The problem host is our "big iron."

Platform Information
--------------------
Host: Sun Fire V40z Server with 4 * Opteron 850 CPU and 32GB RAM

OS: Scientific Linux 4.3, 2.6.9-34.0.1.ELsmp x86_64 kernel

Controller: Sun MegaRAID 320-2X Dual Ultra-320 SCSI Card (p/n X9269A)

Drives: 5 * 146GB 10K RPM Ultra320 SCSI Hard Drive (p/n X9257A)

Host: scsi1 Channel: 01 Id: 06 Lun: 00
  Vendor: SDR      Model: GEM318P          Rev: 1
  Type:   Processor                        ANSI SCSI revision: 02

Host: scsi1 Channel: 02 Id: 00 Lun: 00
  Vendor: MegaRAID Model: LD 0 RAID5  560G Rev: 413G
  Type:   Direct-Access                    ANSI SCSI revision: 02

Configuration is RAID 5 and I took all the defaults when setting it up.

/dev/sda1 on / type ext3 (rw)
/dev/sda4 on /tmp type ext3 (rw)
/dev/sda2 on /var type ext3 (rw)

Filesystem            Size  Used Avail Use% Mounted on
/dev/sda1             7.9G  2.0G  5.5G  27% /
/dev/sda4             437G   18G  398G   5% /tmp
/dev/sda2              63G  129M   60G   1% /var

The nature of the problem is an apparent "hang" for some small finite
period of time (somewhere between ten minutes and two hours according
to user accounts). When the host is "hung," response is very slow. Load
average is quite high (above 10), and % wa is high (~40%). kjournald
is always near the "top" of top, but doesn't appear to be using many
resources, either %CPU or %MEM. But it's always there when the host is
"hung," and when things are running "normally," it only puts in the
occasional appearance.

I really don't know how often it happens. One user reported it happening
on two consecutive days, but this host is used primarily for big SAS or
Stata or Matlab jobs that run in the background (sometimes for days) so
that the "hangs" could happen fairly frequently without being noticed.
That very large /tmp partition is for the use of these programs; /home
is on NFS (gigabit ethernet) and our users have learned they get much
better response using the local drive.

The "hang" always "repairs itself" without human intervention.

My working theory is that there is some sort of negative interaction
between kjournald and the RAID driver; disk accesses (or at least
writes) are being inhibited while kjournald goes about its business.

I'm looking for any suggestions about how to troubleshoot the problem.
I'm reasonably knowledgeable about Linux, but this is my first
experience with RAID. Pointers to any FMs I should R are welcome.

Cheers,
 Pann
-- 
Pann McCuaig <[log in to unmask]>                212-854-8689
Systems Coordinator, Economics Department, Columbia University
Department Computing Resources:
               http://www.columbia.edu/cu/economics/computing/

ATOM RSS1 RSS2