What is the best way for a moderately large (I count about 180 machines)
cluster of SL3.0.5 machines to NFS mount a fileserver which provides
home directories, essential executables, common configuration files,
etc. (i.e., not large amounts of data)?
The NFS server in question has been reasonably reliable (it is now a
commercial NAS which internally is running a Debian variant,
http://www.open-e.com/, but we have had homemade SL and Gentoo NFS
servers in that position, and the same question comes up), but still,
there are failures at what I crudely estimate as a MTBF of about 50-100
days in which the NFS server fails so badly that it has to be rebooted
or power cycled. That may be an acceptable rate of failures, but when
it happens, we almost end up rebooting most of the 180 machines in the
cluster. Again, we have tools that help with that, but it's certainly
not a one-button operation.
So the question is, do we try to make the NFS server an order of
magnitude more reliable (how do we do that?) or do we try to make the
NFS clients recover more gracefully from a failure of their server (how
do we do that?)?
The clients mount filesystems on the server with the automounter with
configuration files supplied by a (Sun) NIS server--the mount directives
are like:
-nfsvers=3,hard,wsize=32768,rsize=32768
phnxsb0.phenix.bnl.gov:/share/software
so the mounts are all hard and nointr (the default). The NFS FAQ scare
one away from soft mounts; maybe intr would be better, but I have a
feeling we'd end up rebooting anyway rather than search for processes
that can be killed. What would really be best would be if everything
hangs while the NFS server is down, then the stale mounts disappear on
the reboot of the server, the processes die, we restart them, and we're
up and running again. Is there any hope of accomplishing that?
--
John Haggerty
email: [log in to unmask]
voice/fax: 631 344 2286/4592
http://www.phenix.bnl.gov/~haggerty
|