Subject: | |
From: | |
Reply To: | |
Date: | Sun, 26 Mar 2006 15:13:36 -0700 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Hello,
I started upgrading one of our clusters (dual - dualcore opterons, 4 GB
RAM, myrinet) today and for 2.6.9-34 and 2.6.9-22.0.2 I see the
following happen when my nodes are pushed very hard:
NMI Watchdog detected LOCKUP, CPU=1, registers:
CPU 1
Modules linked in: gm(U) e1000(U) tg3(U) sd_mod(U) scsi_mod(U) ext3(U)
jbd(U) nfsd(U) exportfs(U) nfs(U) lockd(U) sunrpc(U)
Pid: 0, comm: swapper Tainted: PF 2.6.9-perfctr
RIP: 0010:[<ffffffff803046cb>] <ffffffff803046cb>{.text.lock.spinlock+2}
RSP: 0018:000001007ff8bf18 EFLAGS: 00000086
RAX: 0000000000000000 RBX: 00000100080095e0 RCX: 0000000000000004
RDX: 0000000000000008 RSI: 00000100080095e0 RDI: 00000100080095e0
RBP: 000001007ff8bf38 R08: 0000000000000008 R09: 00000100080095e0
R10: 00000000000000ff R11: 000000000000000c R12: 0000010081ee27e0
R13: 0000000000000008 R14: 0000010081ee27e0 R15: 0000000000000000
FS: 0000002a95aa66e0(0000) GS:ffffffff804d3c00(0000)
knlGS:0000000000000000
CS: 0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000005e1730 CR3: 0000000082b32000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 000001007ff82000, task
000001000802b030)
Stack: 00000100080095e0 ffffffff801320d1 00000100080095e0
0000010081ee3b00
000001007ff8bf98 ffffffff8013276c 0000000000000000
00000000801329ef
00000000fffbedc2 0000000100000000
Call Trace:<IRQ> <ffffffff801320d1>{double_lock_balance+49}
<ffffffff8013276c>{rebalance_tick+313}
<ffffffff8011c505>{smp_apic_timer_interrupt+49}
<ffffffff801109c5>{apic_timer_interrupt+133}
<EOI> <ffffffff8010e609>{default_idle+0}
<ffffffff8010e629>{default_idle+32}
<ffffffff8010e69c>{cpu_idle+26}
Code: 80 3b 00 7e f9 e9 68 fc ff ff f3 90 80 3b 00 7e f9 e9 d4 fc
Kernel panic - not syncing: nmi watchdog
What I've googled up seems to point at I/O on the PCI bus as the source
of the problem:
http://www.ussg.iu.edu/hypermail/linux/kernel/0410.2/0485.html
and I can cause nodes to drop out at will by doing things that would put
load on the PCI bus either through myrinet or through the gig-E cards.
I'm working around it by staying at 2.6.9-11 but I recall reading here
that the newer kernels improve dual core support so if anyone recognizes
this and knows a fix, I'd appreciate pointers.
Thanks,
jbh
|
|
|