SCIENTIFIC-LINUX-USERS Archives

December 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
John Haggerty <[log in to unmask]>
Reply To:
John Haggerty <[log in to unmask]>
Date:
Fri, 16 Dec 2005 14:32:06 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (87 lines)
Martin et al.,

Thanks for that description of your experience.  It does sound very 
similar to what I am seeing.  We're planning to upgrade SL 3.0.2 to 
3.0.5, which has kernel version 2.4.21-32.0.1.EL, but it's talking a 
little while to get that deployed.

Since my post, I have been able to log on to several of the machines as 
root while they are otherwise hung (which is presumably due to hard 
mounts coupled with NFS volumes in the PATH).

What I see in a hung machine are several of the NFS mounts that are ok; 
the daemons running, the network ok, etc.  Basically everything I know 
how to look at is ok, except for one hard mounted NFS filesystem, which 
is a disk where much software resides, things like shared libraries for 
the CERN root software.  That filesystem I can mount again by hand and 
access without problem, but when I do anything that touches the original 
mount, I'm in permanent hard-mount-hang.

A couple of the machines are hung differently; they are scrolling

RPC: sendmsg returned error 12

eternally.  There is a frightening description of similar behavior here:

https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=123226

Bly, MJ (Martin) wrote:
> John et al, 
> 
> The symptoms you describe sound a bit like the NFS client hang problems with some SL (RHEL) kernels in the range before 2.4.21-32.x - possibly as early as 2.4.21-18 or so.  The client processes doing the NFS access hang solid due to a 'lost' interaction between server and client - the only solution is a client reboot.  All variants of SL/RHEL show this for the range of kernels above, and we also saw it for RH 7.3 clients for kernels in a certain range.  
> 
> In fact the more I read your post, the more it is this problem...  We banged our heads against this for months on and off as loads changed and the problem came and went.  It is definitely load related.
> 
> There was a contrib kernel we put out with a patch that backed out the patch that causes the problem.  I think it had a version number of 2.4.21-27.0.something.ELSDR.  
> 
> We believe from anecdotal evidence that RH tried the patch that causes the problem in at least two kernel ranges.  We think they may have given up on the RH 7.3 tests but tried again with RHEL 3 - it's always possible the same 'patch' was used on the RH 8 kernel series. 
> 
> Anyway, they fixed it by taking out the patch at 2.4.21-32.EL.
> 
> And there's a gotcha in the stock autofs for 3.0.5 if you use the & substitution syntax in your maps:
> 
> * < mount options removed > &.stage.rl.ac.uk:/stage/&
> 
> If you mount a non-existant file system /stage/fred the the machine panics and dies.  The autofs for 3.0.4 works (I think) as do the ones for 3.0.3, 3.0.6.
> 
> Martin
> RAL Tier1 Systems Team.
> 
> 
> 
> 
> 
>>-----Original Message-----
>>From: [log in to unmask] 
>>[mailto:[log in to unmask]] On 
>>Behalf Of Miles O'Neal
>>Sent: Wednesday, December 14, 2005 7:42 PM
>>To: Scientific Linux Users
>>Subject: Re: NFS... problems? or the perfect distributed file system?
>>
>>John Haggerty said...
>>|
>>|The discussion of distributed filesystems inspired me to start a new 
>>|thread on NFS.  Like probably everyone on this list, we use NFS to 
>>|share files (home directories, whatever) among machines.  It pretty 
>>|much works ok... except for the occasional problem which 
>>appears to be 
>>|related to NFS.  Does anyone else have low level NFS 
>>problems, or am I the only one?
>>
>>We see the same problem.  Sometimes a bunch of systems will 
>>see it, other times it's just one or two.  Some of the 
>>compute farm systems that have been up for close to a year 
>>have screens full of the messages when you plug a console in.
>>
>>We saw it with RH8, we see it with SL304.
>>
>>-Miles
>>

-- 
John Haggerty
email: [log in to unmask]
voice/fax: 631 344 2286/4592
http://www.phenix.bnl.gov/~haggerty

ATOM RSS1 RSS2