LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

March 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS March 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	"NMI Watchdog detected LOCKUP" on newer SL kernels.
From:	John Hanks <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Sun, 26 Mar 2006 15:13:36 -0700
Content-Type:	text/plain
Parts/Attachments:	text/plain (56 lines)

Hello,

I started upgrading one of our clusters (dual - dualcore opterons, 4 GB
RAM, myrinet) today and for 2.6.9-34 and 2.6.9-22.0.2 I see the
following happen when my nodes are pushed very hard:

NMI Watchdog detected LOCKUP, CPU=1, registers:
CPU 1
Modules linked in: gm(U) e1000(U) tg3(U) sd_mod(U) scsi_mod(U) ext3(U)
jbd(U) nfsd(U) exportfs(U) nfs(U) lockd(U) sunrpc(U)
Pid: 0, comm: swapper Tainted: PF     2.6.9-perfctr
RIP: 0010:[<ffffffff803046cb>] <ffffffff803046cb>{.text.lock.spinlock+2}
RSP: 0018:000001007ff8bf18  EFLAGS: 00000086
RAX: 0000000000000000 RBX: 00000100080095e0 RCX: 0000000000000004
RDX: 0000000000000008 RSI: 00000100080095e0 RDI: 00000100080095e0
RBP: 000001007ff8bf38 R08: 0000000000000008 R09: 00000100080095e0
R10: 00000000000000ff R11: 000000000000000c R12: 0000010081ee27e0
R13: 0000000000000008 R14: 0000010081ee27e0 R15: 0000000000000000
FS:  0000002a95aa66e0(0000) GS:ffffffff804d3c00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000005e1730 CR3: 0000000082b32000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 000001007ff82000, task
000001000802b030)
Stack: 00000100080095e0 ffffffff801320d1 00000100080095e0
0000010081ee3b00
       000001007ff8bf98 ffffffff8013276c 0000000000000000
00000000801329ef
       00000000fffbedc2 0000000100000000
Call Trace:<IRQ> <ffffffff801320d1>{double_lock_balance+49}
<ffffffff8013276c>{rebalance_tick+313}
       <ffffffff8011c505>{smp_apic_timer_interrupt+49}
<ffffffff801109c5>{apic_timer_interrupt+133}
        <EOI> <ffffffff8010e609>{default_idle+0}
<ffffffff8010e629>{default_idle+32}
       <ffffffff8010e69c>{cpu_idle+26}

Code: 80 3b 00 7e f9 e9 68 fc ff ff f3 90 80 3b 00 7e f9 e9 d4 fc
Kernel panic - not syncing: nmi watchdog

What I've googled up seems to point at I/O on the PCI bus as the source
of the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0410.2/0485.html

and I can cause nodes to drop out at will by doing things that would put
load on the PCI bus either through myrinet or through the gig-E cards.

I'm working around it by staying at 2.6.9-11 but I recall reading here
that the newer kernels improve dual core support so if anyone recognizes
this and knows a fix, I'd appreciate pointers. 

Thanks,

jbh

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV