SCIENTIFIC-LINUX-USERS Archives

March 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Andreas Haupt <[log in to unmask]>
Reply To:
Andreas Haupt <[log in to unmask]>
Date:
Mon, 2 Mar 2015 08:33:47 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (130 lines)
Hi Arnau,

over the weekend we managed to provoke an identical behaviour. Jobs
crash during the epilog phase when the job's CGroup gets removed.

Did you already open a bug at Univa or somewhere else?

Cheers,
Andreas

Am Freitag, den 27.02.2015, 13:03 +0100 schrieb Arnau Bria:
> Dear all,
> 
> I'm running SL 6.5. The last update did install kernel
> 2.6.32-504.1.3.el6.x86_64.
> 
> Some of our nodes act as computing nodes from a Univa's GE computing
> nodes, so they are used for running batch jobs. UGE supports cgroups
> and each job that runs in the node creates a cpuset and sets some
> memory limits thought the uge daemon (sge_execd)
> 
> It has been working nicely with our previous kernel:
> 2.6.32-431.29.2.el6.x86_64, but since we did upgrade most of the nodes
> rebooted unexpectedly leaving a vmcore in the crash directory. I.e:
> 
> # ls -lsa /var/crash/127.0.0.1-2015-02-23-08\:35\:23/vmcore
> 1153928 -rw------- 1 root root 1181615424 feb 23 08:40 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore
> 	100 -rw-r--r-- 1 root root      99806 feb 23 08:35 /var/crash/127.0.0.1-2015-02-23-08:35:23/vmcore-dmesg.txt
> 
> 
> When I look at the txt file I see some strange message about cgroup
> BUG, but as I'm not kernel expert I'd like to ask for some help in this
> mailing list. the error shows:
> 
> <3>INFO: task bedtools:32790 blocked for more than 120 seconds.
> <3>      Not tainted 2.6.32-504.1.3.el6.x86_64 #1
> <3>"echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
> <6>bedtools      D 0000000000000008     0 32790  32789 0x00000080
> <4> ffff881f1cf4d9a8 0000000000000082 0000000000000000 ffff881f1cf4d96c
> <4> 0000000000000000 ffff88103fe71800 00001a68da2304b8 ffff880061b768c0
> <4> 0000000000000800 0000000101b6c99f ffff882026c8a638 ffff881f1cf4dfd8
> <4>Call Trace:
> [...]
> this is more or less common and we have some complaining about
> scientific programs (samtools, etc..) but the important thing comes by
> the end of the file:
> 
> 
> 
> <4>------------[ cut here ]------------
> <4>WARNING: at kernel/cgroup.c:4428 __css_put+0x70/0x80() (Not tainted)
> <4>Hardware name: ProLiant BL460c Gen8
> <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf
> _defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd
>  microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw l
> pc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
> <4>Pid: 3512, comm: sge_execd Not tainted 2.6.32-504.1.3.el6.x86_64 #1
> <4>Call Trace:
> <4> [<ffffffff81074df7>] ? warn_slowpath_common+0x87/0xc0
> <4> [<ffffffff81074e4a>] ? warn_slowpath_null+0x1a/0x20
> <4> [<ffffffff810cff80>] ? __css_put+0x70/0x80
> <4> [<ffffffff811813ce>] ? mem_cgroup_force_empty+0x3e/0x50
> <4> [<ffffffff811813f4>] ? mem_cgroup_pre_destroy+0x14/0x20
> <4> [<ffffffff810cfa90>] ? cgroup_rmdir+0xe0/0x560
> <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
> <4> [<ffffffff8119ccf0>] ? vfs_rmdir+0xc0/0xf0
> <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
> <4> [<ffffffff8119ff64>] ? do_rmdir+0x184/0x1f0
> <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
> <4> [<ffffffff811a0026>] ? sys_rmdir+0x16/0x20
> <4> [<ffffffff8100b072>] ? system_call_fastpath+0x16/0x1b
> <4>---[ end trace 8eae4afa57f7484f ]---
> <4>------------[ cut here ]------------
> 
> <4>------------[ cut here ]------------
> <2>kernel BUG at kernel/cgroup.c:3725!
> <4>invalid opcode: 0000 [#1] SMP 
> <4>last sysfs file: /sys/devices/virtual/dmi/id/sys_vendor
> <4>CPU 15 
> <4>Modules linked in: nfs lockd fscache auth_rpcgss nfs_acl sunrpc ipt_addrtype xt_conntrack iptable_filter ipt_MASQUERADE iptable_nat nf_nat nf_conntrack_ipv4 nf_conntrack nf_defrag_ipv4 ip_tables bridge dm_thin_pool dm_bio_prison dm_persistent_data dm_bufio libcrc32c 8021q garp stp llc autofs4 cpufreq_ondemand freq_table pcc_cpufreq ipv6 ext3 jbd microcode power_meter acpi_ipmi ipmi_si ipmi_msghandler iTCO_wdt iTCO_vendor_support hpilo hpwdt sg be2iscsi iscsi_boot_sysfs libiscsi scsi_transport_iscsi be2net serio_raw lpc_ich mfd_core ioatdma dca shpchp ext4 jbd2 mbcache sd_mod crc_t10dif hpsa video output dm_mirror dm_region_hash dm_log dm_mod [last unloaded: scsi_wait_scan]
> <4>
> <4>Pid: 3512, comm: sge_execd Tainted: G        W  ---------------    2.6.32-504.1.3.el6.x86_64 #1 HP ProLiant BL460c Gen8
> <4>RIP: 0010:[<ffffffff810cfef6>]  [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
> <4>RSP: 0018:ffff8820272e3db8  EFLAGS: 00010046
> <4>RAX: 0000000000000004 RBX: ffff882028150200 RCX: ffffffff81c0cb00
> <4>RDX: ffffc9001cf76000 RSI: ffff88102549a000 RDI: 0000000000000246
> <4>RBP: ffff8820272e3e48 R08: 0000000000000000 R09: 0000000000000000
> <4>R10: 000000000000000f R11: 0000000000000008 R12: 0000000000000000
> <4>R13: ffff882028150308 R14: ffff8820272e3de8 R15: ffff882026e4e040
> <4>FS:  00007fc508ca0740(0000) GS:ffff8810788e0000(0000) knlGS:0000000000000000
> <4>CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
> <4>CR2: 00007fc5085e1000 CR3: 00000020271f1000 CR4: 00000000000407e0
> <4>DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> <4>DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400
> <4>Process sge_execd (pid: 3512, threadinfo ffff8820272e2000, task ffff882026e4e040)
> <4>Stack:
> <4> ffff8820272e3e28 ffffffff81c0cb00 ffff882028150220 ffff882028150318
> <4><d> ffff882028150220 ffff88101ed25a00 0000000000000000 ffff882026e4e040
> <4><d> ffffffff8109eb00 ffffffff81aaa768 ffffffff81aaa768 00007fc50842f400
> <4>Call Trace:
> <4> [<ffffffff8109eb00>] ? autoremove_wake_function+0x0/0x40
> <4> [<ffffffff8119ccf0>] vfs_rmdir+0xc0/0xf0
> <4> [<ffffffff8119bdea>] ? lookup_hash+0x3a/0x50
> <4> [<ffffffff8119ff64>] do_rmdir+0x184/0x1f0
> <4> [<ffffffff810e5c87>] ? audit_syscall_entry+0x1d7/0x200
> <4> [<ffffffff811a0026>] sys_rmdir+0x16/0x20
> <4> [<ffffffff8100b072>] system_call_fastpath+0x16/0x1b
> <4>Code: 45 00 31 c0 e9 40 fe ff ff 48 8d b8 90 00 00 00 e8 80 ae 45 00 e9 1e ff ff ff 48 8d b8 90 00 00 00 e8 6f af 45 00 e9 c9 fe ff ff <0f> 0b eb fe 0f 0b 0f 1f 40 00 eb fa 0f 0b eb fe 66 2e 0f 1f 84 
> <1>RIP  [<ffffffff810cfef6>] cgroup_rmdir+0x546/0x560
> <4> RSP <ffff8820272e3db8>
> 
> 
> 
> then the node reboots. I've read about tainted kernels  but I
> can't figure out what's happening in my systems. Anyone could help me
> to understand what going on? is this a real cgroups bug? maybe
> sge_execd doing strange things when purging cgroups? do I have to
> report this to kernel developers (sip) or to Univa?
> 
> 
> Many thanks in advance,
> Cheers,
> Arnau

-- 
| Andreas Haupt            | E-Mail: [log in to unmask]
|  DESY Zeuthen            | WWW:    http://www-zeuthen.desy.de/~ahaupt
|  Platanenallee 6         | Phone:  +49/33762/7-7359
|  D-15738 Zeuthen         | Fax:    +49/33762/7-7216

ATOM RSS1 RSS2