Subject: | |
From: | |
Reply To: | |
Date: | Mon, 19 Dec 2005 16:07:43 +0000 |
Content-Type: | TEXT/PLAIN |
Parts/Attachments: |
|
|
On Fri, 16 Dec 2005, John Haggerty wrote:
> Martin et al.,
>
> Thanks for that description of your experience. It does sound very similar
> to what I am seeing. We're planning to upgrade SL 3.0.2 to 3.0.5, which has
> kernel version 2.4.21-32.0.1.EL, but it's talking a little while to get that
> deployed.
>
> Since my post, I have been able to log on to several of the machines as root
> while they are otherwise hung (which is presumably due to hard mounts coupled
> with NFS volumes in the PATH).
>
> What I see in a hung machine are several of the NFS mounts that are ok; the
> daemons running, the network ok, etc. Basically everything I know how to
> look at is ok, except for one hard mounted NFS filesystem, which is a disk
> where much software resides, things like shared libraries for the CERN root
> software. That filesystem I can mount again by hand and access without
> problem, but when I do anything that touches the original mount, I'm in
> permanent hard-mount-hang.
It may not help you much but you *can* forcably umount the bad directory
using the Force option (-f on umount) which has been available since Linux
2.2.x. They added the Lazy option -l in 2.4.x. In particular you
probably will need -fl to do the forcable umount lazily (needed if the
server isn't responding for whatever reason).
This will cause anything with open/mapped files on the fs to be killed
next time they access the bad-fs but that may still be somewhat better
than having to reboot.
We have a script run periodially which checks for hung mounts (in our case
they are usually caused by servers really being down or network problems),
and forcably umounts them. We then can restart am-utils to mount from a
different replica (for applications/libs at least). Of course it doesn't
help much for things which are mounted for rw (home/data stores etc), but
at least apps which arn't touching the bad-fs can carry on.
Of course with am-utils we can quiery if a directory is *believed* to be
alive (without always having to touch the mount-point). I don't know if
autofs has a similar interface or not.
> A couple of the machines are hung differently; they are scrolling
>
> RPC: sendmsg returned error 12
-- Jon
|
|
|