SCIENTIFIC-LINUX-USERS Archives

December 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Bly, MJ (Martin)" <[log in to unmask]>
Reply To:
Bly, MJ (Martin)
Date:
Fri, 16 Dec 2005 21:08:35 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (137 lines)
> -----Original Message-----
> From: [log in to unmask] 
> [mailto:[log in to unmask]] On 
> Behalf Of John Haggerty
> Sent: Friday, December 16, 2005 7:32 PM
> To: [log in to unmask]
> Cc: Scientific Linux Users
> Subject: Re: NFS... problems? or the perfect distributed file system?
> 
> Martin et al.,
> 
> Thanks for that description of your experience.  It does 
> sound very similar to what I am seeing.  We're planning to 
> upgrade SL 3.0.2 to 3.0.5, which has kernel version 
> 2.4.21-32.0.1.EL, but it's talking a little while to get that 
> deployed.
> 
> Since my post, I have been able to log on to several of the 
> machines as root while they are otherwise hung (which is 
> presumably due to hard mounts coupled with NFS volumes in the PATH).

We saw the same.  All clients could see all points on the the hung file system except the points anywhere under the place the hung client was looking - as soon as any process on the affected macine hits that, it's hung.  Other machines can see it, including the hung place.

> What I see in a hung machine are several of the NFS mounts 
> that are ok; the daemons running, the network ok, etc.  
> Basically everything I know how to look at is ok, except for 
> one hard mounted NFS filesystem, which is a disk where much 
> software resides, things like shared libraries for the CERN 
> root software.  That filesystem I can mount again by hand and 
> access without problem, but when I do anything that touches 
> the original mount, I'm in permanent hard-mount-hang.

Sounds familiar.

> A couple of the machines are hung differently; they are scrolling
> 
> RPC: sendmsg returned error 12
> 
> eternally.  There is a frightening description of similar 
> behavior here:
> 
> https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123226

Not seen that here...

M.

Martin Bly
RAL Tier1 Systems Team

--
  T: +44|0 1235 446981 F: +44|0 1235 446626 

> Bly, MJ (Martin) wrote:
> > John et al,
> > 
> > The symptoms you describe sound a bit like the NFS client 
> hang problems with some SL (RHEL) kernels in the range before 
> 2.4.21-32.x - possibly as early as 2.4.21-18 or so.  The 
> client processes doing the NFS access hang solid due to a 
> 'lost' interaction between server and client - the only 
> solution is a client reboot.  All variants of SL/RHEL show 
> this for the range of kernels above, and we also saw it for 
> RH 7.3 clients for kernels in a certain range.  
> > 
> > In fact the more I read your post, the more it is this 
> problem...  We banged our heads against this for months on 
> and off as loads changed and the problem came and went.  It 
> is definitely load related.
> > 
> > There was a contrib kernel we put out with a patch that 
> backed out the patch that causes the problem.  I think it had 
> a version number of 2.4.21-27.0.something.ELSDR.  
> > 
> > We believe from anecdotal evidence that RH tried the patch 
> that causes the problem in at least two kernel ranges.  We 
> think they may have given up on the RH 7.3 tests but tried 
> again with RHEL 3 - it's always possible the same 'patch' was 
> used on the RH 8 kernel series. 
> > 
> > Anyway, they fixed it by taking out the patch at 2.4.21-32.EL.
> > 
> > And there's a gotcha in the stock autofs for 3.0.5 if you 
> use the & substitution syntax in your maps:
> > 
> > * < mount options removed > &.stage.rl.ac.uk:/stage/&
> > 
> > If you mount a non-existant file system /stage/fred the the 
> machine panics and dies.  The autofs for 3.0.4 works (I 
> think) as do the ones for 3.0.3, 3.0.6.
> > 
> > Martin
> > RAL Tier1 Systems Team.
> > 
> > 
> > 
> > 
> > 
> >>-----Original Message-----
> >>From: [log in to unmask]
> >>[mailto:[log in to unmask]] On 
> Behalf Of 
> >>Miles O'Neal
> >>Sent: Wednesday, December 14, 2005 7:42 PM
> >>To: Scientific Linux Users
> >>Subject: Re: NFS... problems? or the perfect distributed 
> file system?
> >>
> >>John Haggerty said...
> >>|
> >>|The discussion of distributed filesystems inspired me to 
> start a new 
> >>|thread on NFS.  Like probably everyone on this list, we use NFS to 
> >>|share files (home directories, whatever) among machines.  
> It pretty 
> >>|much works ok... except for the occasional problem which
> >>appears to be
> >>|related to NFS.  Does anyone else have low level NFS
> >>problems, or am I the only one?
> >>
> >>We see the same problem.  Sometimes a bunch of systems will see it, 
> >>other times it's just one or two.  Some of the compute farm systems 
> >>that have been up for close to a year have screens full of the 
> >>messages when you plug a console in.
> >>
> >>We saw it with RH8, we see it with SL304.
> >>
> >>-Miles
> >>
> 
> --
> John Haggerty
> email: [log in to unmask]
> voice/fax: 631 344 2286/4592
> http://www.phenix.bnl.gov/~haggerty
> 

ATOM RSS1 RSS2