SCIENTIFIC-LINUX-USERS Archives

May 2010

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Charles G Waldman <[log in to unmask]>
Reply To:
Charles G Waldman <[log in to unmask]>
Date:
Wed, 12 May 2010 17:32:28 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (25 lines)
Hello all Scientific Linux users and experts:

 about a month ago we started seeing a large number of nodes going
into a state where they would use 100% system CPU, the load would
go to about 100, and no useful work was getting done.  Nodes would
not recover from this state without a reboot.  The log files showed
many messages like

uct2-c185/kern20100511:May 11 10:04:34 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]
uct2-c185/kern20100511:May 11 12:06:36 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]

Doing a little research led us to believe that we were seeing this bug:


https://bugzilla.redhat.com/show_bug.cgi?id=547530

and according to that page, the fix has been backported to kernel-2.6.18-164.11.1.el5    

We upgraded all of our cluster hosts to this kernel version, but the error
is still occurring.  Any ideas or suggestions?

   Thanks,

	  - Charles

ATOM RSS1 RSS2