Subject: | |
From: | |
Reply To: | |
Date: | Fri, 27 Feb 2015 13:03:46 +0100 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Dear all,
I'm running SL 6.5. The last update did install kernel
2.6.32-504.1.3.el6.x86_64.
Some of our nodes act as computing nodes from a Univa's GE computing
nodes, so they are used for running batch jobs. UGE supports cgroups
and each job that runs in the node creates a cpuset and sets some
memory limits thought the uge daemon (sge_execd)
It has been working nicely with our previous kernel:
2.6.32-431.29.2.el6.x86_64, but since we did upgrade most of the nodes
rebooted unexpectedly leaving a vmcore in the crash directory. I.e:
# ls -lsa /var/crash/127.0.0.1-2015-02-23-08\:35\:23/vmcore
1153928 -rw------- 1 root root 1181615424 feb 23 08:40 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore
100 -rw-r--r-- 1 root root 99806 feb 23 08:35 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore-dmesg.txt
When I look at the txt file I see some strange message about cgroup
BUG, but as I'm not kernel expert I'd like to ask for some help in this
mailing list. the error shows:
<3>INFO: task bedtools:32790 blocked for more than 120 seconds.
<3> Not tainted 2.6.32-504.1.3.el6.x86_64 #1
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>bedtools D 0000000000000008 0 32790 32789 0x00000080
<4> ffff881f1cf4d9a8 0000000000000082 0000000000000000 ffff881f1cf4d96c
<4> 0000000000000000 ffff88103fe71800 00001a68da2304b8 ffff880061b768c0
<4> 0000000000000800 0000000101b6c99f ffff882026c8a638 ffff881f1cf4dfd8
<4>Call Trace:
[...]
this is more or less common and we have some complaining about
scientific programs (samtools, etc..) but the important thing comes by
the end of the file:
<4>------------[ cut here ]------------
<4>WARNING: at kernel/cgroup.c:4428 __css_put+0x70/0x80() (Not tainted)
<4>Hardware name: ProLiant BL460c Gen8
<4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf
_defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd
microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw l
pc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>Pid: 3512, comm: sge_execd Not tainted 2.6.32-504.1.3.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
<4> [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
<4> [<ffffffff810cff80>] ? __css_put+0x70/0x80
<4> [<ffffffff811813ce>] ? mem_cgroup_force_empty+0x3e/0x50
<4> [<ffffffff811813f4>] ? mem_cgroup_pre_destroy+0x14/0x20
<4> [<ffffffff810cfa90>] ? cgroup_rmdir+0xe0/0x560
<4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8119ccf0>] ? vfs_rmdir+0xc0/0xf0
<4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
<4> [<ffffffff8119ff64>] ? do_rmdir+0x184/0x1f0
<4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
<4> [<ffffffff811a0026>] ? sys_rmdir+0x16/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
<4>---[ end trace 8eae4afa57f7484f ]---
<4>------------[ cut here ]------------
<4>------------[ cut here ]------------
<2>kernel BUG at kernel/cgroup.c:3725!
<4>invalid opcode: 0000 [#1] SMP
<4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
<4>CPU 15
<4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3512, comm: sge_execd Tainted: G W --------------- 2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant BL460c Gen8
<4>RIP: 0010:[<ffffffff810cfef6>] [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
<4>RSP: 0018:ffff8820272e3db8 EFLAGS: 00010046
<4>RAX: 0000000000000004 RBX: ffff882028150200 RCX: ffffffff81c0cb00
<4>RDX: ffffc9001cf76000 RSI: ffff88102549a000 RDI: 0000000000000246
<4>RBP: ffff8820272e3e48 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 000000000000000f R11: 0000000000000008 R12: 0000000000000000
<4>R13: ffff882028150308 R14: ffff8820272e3de8 R15: ffff882026e4e040
<4>FS: 00007fc508ca0740(0000) GS:ffff8810788e0000(0000) knlGS:0000000000000000
<4>CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 00007fc5085e1000 CR3: 00000020271f1000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process sge_execd (pid: 3512, threadinfo ffff8820272e2000, task ffff882026e4e040)
<4>Stack:
<4> ffff8820272e3e28 ffffffff81c0cb00 ffff882028150220 ffff882028150318
<4><d> ffff882028150220 ffff88101ed25a00 0000000000000000 ffff882026e4e040
<4><d> ffffffff8109eb00 ffffffff81aaa768 ffffffff81aaa768 00007fc50842f400
<4>Call Trace:
<4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8119ccf0>] vfs_rmdir+0xc0/0xf0
<4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
<4> [<ffffffff8119ff64>] do_rmdir+0x184/0x1f0
<4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
<4> [<ffffffff811a0026>] sys_rmdir+0x16/0x20
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 45 00 31 c0 e9 40 fe ff ff 48 8d b8 90 00 00 00 e8 80 ae 45 00 e9 1e ff ff ff 48 8d b8 90 00 00 00 e8 6f af 45 00 e9 c9 fe ff ff <0f> 0b eb fe 0f 0b 0f 1f 40 00 eb fa 0f 0b eb fe 66 2e 0f 1f 84
<1>RIP [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
<4> RSP <ffff8820272e3db8>
then the node reboots. I've read about tainted kernels but I
can't figure out what's happening in my systems. Anyone could help me
to understand what going on? is this a real cgroups bug? maybe
sge_execd doing strange things when purging cgroups? do I have to
report this to kernel developers (sip) or to Univa?
Many thanks in advance,
Cheers,
Arnau
|
|
|