LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

December 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS December 2005

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: NFS... problems? or the perfect distributed file system?
From:	Jon Peatfield <[log in to unmask]>
Reply To:	Jon Peatfield <[log in to unmask]>
Date:	Mon, 19 Dec 2005 16:07:43 +0000
Content-Type:	TEXT/PLAIN
Parts/Attachments:	TEXT/PLAIN (48 lines)

On Fri, 16 Dec 2005, John Haggerty wrote:

> Martin et al.,
>
> Thanks for that description of your experience.  It does sound very similar 
> to what I am seeing.  We're planning to upgrade SL 3.0.2 to 3.0.5, which has 
> kernel version 2.4.21-32.0.1.EL, but it's talking a little while to get that 
> deployed.
>
> Since my post, I have been able to log on to several of the machines as root 
> while they are otherwise hung (which is presumably due to hard mounts coupled 
> with NFS volumes in the PATH).
>
> What I see in a hung machine are several of the NFS mounts that are ok; the 
> daemons running, the network ok, etc.  Basically everything I know how to 
> look at is ok, except for one hard mounted NFS filesystem, which is a disk 
> where much software resides, things like shared libraries for the CERN root 
> software.  That filesystem I can mount again by hand and access without 
> problem, but when I do anything that touches the original mount, I'm in 
> permanent hard-mount-hang.

It may not help you much but you *can* forcably umount the bad directory 
using the Force option (-f on umount) which has been available since Linux 
2.2.x.  They added the Lazy option -l in 2.4.x.  In particular you 
probably will need -fl to do the forcable umount lazily (needed if the 
server isn't responding for whatever reason).

This will cause anything with open/mapped files on the fs to be killed 
next time they access the bad-fs but that may still be somewhat better 
than having to reboot.

We have a script run periodially which checks for hung mounts (in our case 
they are usually caused by servers really being down or network problems), 
and forcably umounts them.  We then can restart am-utils to mount from a 
different replica (for applications/libs at least).  Of course it doesn't 
help much for things which are mounted for rw (home/data stores etc), but 
at least apps which arn't touching the bad-fs can carry on.

Of course with am-utils we can quiery if a directory is *believed* to be 
alive (without always having to touch the mount-point).  I don't know if 
autofs has a similar interface or not.

> A couple of the machines are hung differently; they are scrolling
>
> RPC: sendmsg returned error 12

  -- Jon

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV