Subject: | |
From: | |
Reply To: | |
Date: | Mon, 14 Aug 2006 15:20:46 -0500 |
Content-Type: | TEXT/PLAIN |
Parts/Attachments: |
|
|
Do you have free memory, or maybe the question should be do you have
too much free memory.
There was a patch put in by the Upstream vendor around 2.6.9-22 that
changed how much free memory was "available". If you see more free memory
than you expect then this might be the bug.
Fixed in errata kernel that will be pushed really shortly.
https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=188141
-Connie Sieh
On Mon, 14 Aug 2006, Pann
McCuaig wrote:
> Greetings!
>
> I'm having an occasional problem with one host in a cluster of nine. The
> eight hosts that are not giving me this headache are identical (except
> for amount of RAM) Sun Fire V20z with dual Opteron 850 CPUs and either
> 4G or 8G of RAM. The problem host is our "big iron."
>
> Platform Information
> --------------------
> Host: Sun Fire V40z Server with 4 * Opteron 850 CPU and 32GB RAM
>
> OS: Scientific Linux 4.3, 2.6.9-34.0.1.ELsmp x86_64 kernel
>
> Controller: Sun MegaRAID 320-2X Dual Ultra-320 SCSI Card (p/n X9269A)
>
> Drives: 5 * 146GB 10K RPM Ultra320 SCSI Hard Drive (p/n X9257A)
>
> Host: scsi1 Channel: 01 Id: 06 Lun: 00
> Vendor: SDR Model: GEM318P Rev: 1
> Type: Processor ANSI SCSI revision: 02
>
> Host: scsi1 Channel: 02 Id: 00 Lun: 00
> Vendor: MegaRAID Model: LD 0 RAID5 560G Rev: 413G
> Type: Direct-Access ANSI SCSI revision: 02
>
> Configuration is RAID 5 and I took all the defaults when setting it up.
>
> /dev/sda1 on / type ext3 (rw)
> /dev/sda4 on /tmp type ext3 (rw)
> /dev/sda2 on /var type ext3 (rw)
>
> Filesystem Size Used Avail Use% Mounted on
> /dev/sda1 7.9G 2.0G 5.5G 27% /
> /dev/sda4 437G 18G 398G 5% /tmp
> /dev/sda2 63G 129M 60G 1% /var
>
> The nature of the problem is an apparent "hang" for some small finite
> period of time (somewhere between ten minutes and two hours according
> to user accounts). When the host is "hung," response is very slow. Load
> average is quite high (above 10), and % wa is high (~40%). kjournald
> is always near the "top" of top, but doesn't appear to be using many
> resources, either %CPU or %MEM. But it's always there when the host is
> "hung," and when things are running "normally," it only puts in the
> occasional appearance.
>
> I really don't know how often it happens. One user reported it happening
> on two consecutive days, but this host is used primarily for big SAS or
> Stata or Matlab jobs that run in the background (sometimes for days) so
> that the "hangs" could happen fairly frequently without being noticed.
> That very large /tmp partition is for the use of these programs; /home
> is on NFS (gigabit ethernet) and our users have learned they get much
> better response using the local drive.
>
> The "hang" always "repairs itself" without human intervention.
>
> My working theory is that there is some sort of negative interaction
> between kjournald and the RAID driver; disk accesses (or at least
> writes) are being inhibited while kjournald goes about its business.
>
> I'm looking for any suggestions about how to troubleshoot the problem.
> I'm reasonably knowledgeable about Linux, but this is my first
> experience with RAID. Pointers to any FMs I should R are welcome.
>
> Cheers,
> Pann
>
|
|
|