SCIENTIFIC-LINUX-USERS Archives

March 2015

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Andy Wettstein <[log in to unmask]>
Reply To:
Andy Wettstein <[log in to unmask]>
Date:
Wed, 11 Mar 2015 09:46:37 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (48 lines)
Hi,

I've seen a similar problem with Slurm on various kernels:
http://bugs.schedmd.com/show_bug.cgi?id=1242

This is likely a kernel bug that has existed for a long time. I found a
mailing list message from November of 2011 with similar problems:
https://lists.linux-foundation.org/pipermail/containers/2011-November/028382.html

I finally decided to just disable cgroup enforcement in slurm and use an
alternate slurm method for killing jobs that go over the memory limit.

I did not file a bug with redhat at the time.

Andy


On Mon, Mar 02, 2015 at 02:49:52PM +0100, Andreas Haupt wrote:
> Hi Arnau,
> 
> Am Montag, den 02.03.2015, 10:59 +0100 schrieb Arnau Bria:
> > In our case the only option is downgrade. the bug affects any kind of
> > node and is not predictable, so the only option (if we want to run
> > newer kernel) is removing cgroups support.
> > So in our case we can live with an old kernel version.
> 
> As we encounter a race condition here obviously, I wonder if you could
> find out some statistics. It is really just a small fraction of jobs
> that are affected here. In our case it looks like the chance for a crash
> is increased if more than 1 job finishes at some point in time.
> 
> Do you observe something similar?
> 
> Cheers,
> Andreas
> -- 
> | Andreas Haupt            | E-Mail: [log in to unmask]
> |  DESY Zeuthen            | WWW:    http://www-zeuthen.desy.de/~ahaupt
> |  Platanenallee 6         | Phone:  +49/33762/7-7359
> |  D-15738 Zeuthen         | Fax:    +49/33762/7-7216

-- 
andy wettstein
hpc system administrator
research computing center
university of chicago
773.702.1104

ATOM RSS1 RSS2