LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

May 2010

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS May 2010

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	kernel BUG: soft lockup in kernel 2.6.18-164.15.1.el5
From:	Charles G Waldman <[log in to unmask]>
Reply To:	Charles G Waldman <[log in to unmask]>
Date:	Wed, 12 May 2010 17:32:28 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (25 lines)

Hello all Scientific Linux users and experts:

 about a month ago we started seeing a large number of nodes going
into a state where they would use 100% system CPU, the load would
go to about 100, and no useful work was getting done.  Nodes would
not recover from this state without a reboot.  The log files showed
many messages like

uct2-c185/kern20100511:May 11 10:04:34 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]
uct2-c185/kern20100511:May 11 12:06:36 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]

Doing a little research led us to believe that we were seeing this bug:


https://bugzilla.redhat.com/show_bug.cgi?id=547530

and according to that page, the fix has been backported to kernel-2.6.18-164.11.1.el5    

We upgraded all of our cluster hosts to this kernel version, but the error
is still occurring.  Any ideas or suggestions?

   Thanks,

	  - Charles

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV