SCIENTIFIC-LINUX-USERS Archives

February 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Arnau Bria <[log in to unmask]>
Reply To:
Arnau Bria <[log in to unmask]>
Date:
Fri, 27 Feb 2015 13:03:46 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (113 lines)
Dear all,

I'm running SL 6.5. The last update did install kernel
2.6.32-504.1.3.el6.x86_64.

Some of our nodes act as computing nodes from a Univa's GE computing
nodes, so they are used for running batch jobs. UGE supports cgroups
and each job that runs in the node creates a cpuset and sets some
memory limits thought the uge daemon (sge_execd)

It has been working nicely with our previous kernel:
2.6.32-431.29.2.el6.x86_64, but since we did upgrade most of the nodes
rebooted unexpectedly leaving a vmcore in the crash directory. I.e:

# ls -lsa /var/crash/127.0.0.1-2015-02-23-08\:35\:23/vmcore
1153928 -rw------- 1 root root 1181615424 feb 23 08:40 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore
	100 -rw-r--r-- 1 root root      99806 feb 23 08:35 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore-dmesg.txt


When I look at the txt file I see some strange message about cgroup
BUG, but as I'm not kernel expert I'd like to ask for some help in this
mailing list. the error shows:

<3>INFO: task bedtools:32790 blocked for more than 120 seconds.
<3>      Not tainted 2.6.32-504.1.3.el6.x86_64 #1
<3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
<6>bedtools      D 0000000000000008     0 32790  32789 0x00000080
<4> ffff881f1cf4d9a8 0000000000000082 0000000000000000 ffff881f1cf4d96c
<4> 0000000000000000 ffff88103fe71800 00001a68da2304b8 ffff880061b768c0
<4> 0000000000000800 0000000101b6c99f ffff882026c8a638 ffff881f1cf4dfd8
<4>Call Trace:
[...]
this is more or less common and we have some complaining about
scientific programs (samtools, etc..) but the important thing comes by
the end of the file:



<4>------------[ cut here ]------------
<4>WARNING: at kernel/cgroup.c:4428 __css_put+0x70/0x80() (Not tainted)
<4>Hardware name: ProLiant BL460c Gen8
<4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf
_defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd
 microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw l
pc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>Pid: 3512, comm: sge_execd Not tainted 2.6.32-504.1.3.el6.x86_64 #1
<4>Call Trace:
<4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
<4> [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
<4> [<ffffffff810cff80>] ? __css_put+0x70/0x80
<4> [<ffffffff811813ce>] ? mem_cgroup_force_empty+0x3e/0x50
<4> [<ffffffff811813f4>] ? mem_cgroup_pre_destroy+0x14/0x20
<4> [<ffffffff810cfa90>] ? cgroup_rmdir+0xe0/0x560
<4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8119ccf0>] ? vfs_rmdir+0xc0/0xf0
<4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
<4> [<ffffffff8119ff64>] ? do_rmdir+0x184/0x1f0
<4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
<4> [<ffffffff811a0026>] ? sys_rmdir+0x16/0x20
<4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
<4>---[ end trace 8eae4afa57f7484f ]---
<4>------------[ cut here ]------------

<4>------------[ cut here ]------------
<2>kernel BUG at kernel/cgroup.c:3725!
<4>invalid opcode: 0000 [#1] SMP 
<4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
<4>CPU 15 
<4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
<4>
<4>Pid: 3512, comm: sge_execd Tainted: G        W  ---------------    2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant BL460c Gen8
<4>RIP: 0010:[<ffffffff810cfef6>]  [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
<4>RSP: 0018:ffff8820272e3db8  EFLAGS: 00010046
<4>RAX: 0000000000000004 RBX: ffff882028150200 RCX: ffffffff81c0cb00
<4>RDX: ffffc9001cf76000 RSI: ffff88102549a000 RDI: 0000000000000246
<4>RBP: ffff8820272e3e48 R08: 0000000000000000 R09: 0000000000000000
<4>R10: 000000000000000f R11: 0000000000000008 R12: 0000000000000000
<4>R13: ffff882028150308 R14: ffff8820272e3de8 R15: ffff882026e4e040
<4>FS:  00007fc508ca0740(0000) GS:ffff8810788e0000(0000) knlGS:0000000000000000
<4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
<4>CR2: 00007fc5085e1000 CR3: 00000020271f1000 CR4: 00000000000407e0
<4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
<4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
<4>Process sge_execd (pid: 3512, threadinfo ffff8820272e2000, task ffff882026e4e040)
<4>Stack:
<4> ffff8820272e3e28 ffffffff81c0cb00 ffff882028150220 ffff882028150318
<4><d> ffff882028150220 ffff88101ed25a00 0000000000000000 ffff882026e4e040
<4><d> ffffffff8109eb00 ffffffff81aaa768 ffffffff81aaa768 00007fc50842f400
<4>Call Trace:
<4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
<4> [<ffffffff8119ccf0>] vfs_rmdir+0xc0/0xf0
<4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
<4> [<ffffffff8119ff64>] do_rmdir+0x184/0x1f0
<4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
<4> [<ffffffff811a0026>] sys_rmdir+0x16/0x20
<4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
<4>Code: 45 00 31 c0 e9 40 fe ff ff 48 8d b8 90 00 00 00 e8 80 ae 45 00 e9 1e ff ff ff 48 8d b8 90 00 00 00 e8 6f af 45 00 e9 c9 fe ff ff <0f> 0b eb fe 0f 0b 0f 1f 40 00 eb fa 0f 0b eb fe 66 2e 0f 1f 84 
<1>RIP  [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
<4> RSP <ffff8820272e3db8>



then the node reboots. I've read about tainted kernels  but I
can't figure out what's happening in my systems. Anyone could help me
to understand what going on? is this a real cgroups bug? maybe
sge_execd doing strange things when purging cgroups? do I have to
report this to kernel developers (sip) or to Univa?


Many thanks in advance,
Cheers,
Arnau

ATOM RSS1 RSS2