Subject: | |
From: | |
Reply To: | |
Date: | Wed, 11 Mar 2015 09:46:37 -0500 |
Content-Type: | text/plain |
Parts/Attachments: |
|
|
Hi,
I've seen a similar problem with Slurm on various kernels:
http://bugs.schedmd.com/show_bug.cgi?id=1242
This is likely a kernel bug that has existed for a long time. I found a
mailing list message from November of 2011 with similar problems:
https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html
I finally decided to just disable cgroup enforcement in slurm and use an
alternate slurm method for killing jobs that go over the memory limit.
I did not file a bug with redhat at the time.
Andy
On Mon, Mar 02, 2015 at 02:49:52PM +0100, Andreas Haupt wrote:
> Hi Arnau,
>
> Am Montag, den 02.03.2015, 10:59 +0100 schrieb Arnau Bria:
> > In our case the only option is downgrade. the bug affects any kind of
> > node and is not predictable, so the only option (if we want to run
> > newer kernel) is removing cgroups support.
> > So in our case we can live with an old kernel version.
>
> As we encounter a race condition here obviously, I wonder if you could
> find out some statistics. It is really just a small fraction of jobs
> that are affected here. In our case it looks like the chance for a crash
> is increased if more than 1 job finishes at some point in time.
>
> Do you observe something similar?
>
> Cheers,
> Andreas
> --
> | Andreas Haupt | E-Mail: [log in to unmask]
> | DESY Zeuthen | WWW: http://www-zeuthen.desy.de/~ahaupt
> | Platanenallee 6 | Phone: +49/33762/7-7359
> | D-15738 Zeuthen | Fax: +49/33762/7-7216
--
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104
|
|
|