SCIENTIFIC-LINUX-USERS Archives

March 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
John Hanks <[log in to unmask]>
Reply To:
Date:
Sun, 26 Mar 2006 15:13:36 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (56 lines)
Hello,

I started upgrading one of our clusters (dual - dualcore opterons, 4 GB
RAM, myrinet) today and for 2.6.9-34 and 2.6.9-22.0.2 I see the
following happen when my nodes are pushed very hard:

NMI Watchdog detected LOCKUP, CPU=1, registers:
CPU 1
Modules linked in: gm(U) e1000(U) tg3(U) sd_mod(U) scsi_mod(U) ext3(U)
jbd(U) nfsd(U) exportfs(U) nfs(U) lockd(U) sunrpc(U)
Pid: 0, comm: swapper Tainted: PF     2.6.9-perfctr
RIP: 0010:[<ffffffff803046cb>] <ffffffff803046cb>{.text.lock.spinlock+2}
RSP: 0018:000001007ff8bf18  EFLAGS: 00000086
RAX: 0000000000000000 RBX: 00000100080095e0 RCX: 0000000000000004
RDX: 0000000000000008 RSI: 00000100080095e0 RDI: 00000100080095e0
RBP: 000001007ff8bf38 R08: 0000000000000008 R09: 00000100080095e0
R10: 00000000000000ff R11: 000000000000000c R12: 0000010081ee27e0
R13: 0000000000000008 R14: 0000010081ee27e0 R15: 0000000000000000
FS:  0000002a95aa66e0(0000) GS:ffffffff804d3c00(0000)
knlGS:0000000000000000
CS:  0010 DS: 0018 ES: 0018 CR0: 000000008005003b
CR2: 00000000005e1730 CR3: 0000000082b32000 CR4: 00000000000006e0
Process swapper (pid: 0, threadinfo 000001007ff82000, task
000001000802b030)
Stack: 00000100080095e0 ffffffff801320d1 00000100080095e0
0000010081ee3b00
       000001007ff8bf98 ffffffff8013276c 0000000000000000
00000000801329ef
       00000000fffbedc2 0000000100000000
Call Trace:<IRQ> <ffffffff801320d1>{double_lock_balance+49}
<ffffffff8013276c>{rebalance_tick+313}
       <ffffffff8011c505>{smp_apic_timer_interrupt+49}
<ffffffff801109c5>{apic_timer_interrupt+133}
        <EOI> <ffffffff8010e609>{default_idle+0}
<ffffffff8010e629>{default_idle+32}
       <ffffffff8010e69c>{cpu_idle+26}

Code: 80 3b 00 7e f9 e9 68 fc ff ff f3 90 80 3b 00 7e f9 e9 d4 fc
Kernel panic - not syncing: nmi watchdog

What I've googled up seems to point at I/O on the PCI bus as the source
of the problem:

http://www.ussg.iu.edu/hypermail/linux/kernel/0410.2/0485.html

and I can cause nodes to drop out at will by doing things that would put
load on the PCI bus either through myrinet or through the gig-E cards.

I'm working around it by staying at 2.6.9-11 but I recall reading here
that the newer kernels improve dual core support so if anyone recognizes
this and knows a fix, I'd appreciate pointers. 

Thanks,

jbh

ATOM RSS1 RSS2