Hello all Scientific Linux users and experts:
about a month ago we started seeing a large number of nodes going
into a state where they would use 100% system CPU, the load would
go to about 100, and no useful work was getting done. Nodes would
not recover from this state without a reboot. The log files showed
many messages like
uct2-c185/kern20100511:May 11 10:04:34 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]
uct2-c185/kern20100511:May 11 12:06:36 uct2-c185 kernel 03 [kern.err] kernel: BUG: soft lockup - CPU#0 stuck for 10s! [events/0:26]
Doing a little research led us to believe that we were seeing this bug:
https://bugzilla.redhat.com/show_bug.cgi?id=547530
and according to that page, the fix has been backported to kernel-2.6.18-164.11.1.el5
We upgraded all of our cluster hosts to this kernel version, but the error
is still occurring. Any ideas or suggestions?
Thanks,
- Charles