On Wed, 11 Mar 2015 09:46:37 -0500
Andy Wettstein wrote:
> Hi,
Hi Andy,
> I've seen a similar problem with Slurm on various kernels:
> http://bugs.schedmd.com/show_bug.cgi?id=1242
this is not the same issue as we are seeing:
- In our case the system reboots.
- I see it when many jobs finish at the same time, not when jobs
finish one by one.
the cgroups thing has been working until last kernel upgrade.
> This is likely a kernel bug that has existed for a long time. I found
> a mailing list message from November of 2011 with similar problems:
> https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html
well, in my case it works perfectly with "old" kernel
2.6.32-431.29.2.el6.x86_64, so seems that something has been fixed since 2011.
> I finally decided to just disable cgroup enforcement in slurm and use
> an alternate slurm method for killing jobs that go over the memory
> limit.
I use cgroups no only for limiting the memory usage, I like the
resource usage isolation (cpusets).
> I did not file a bug with redhat at the time.
Seems that RH accepted Andrea's bug, so seems that there is something
wrong there.
> Andy
Cheers,
Arnau