LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

March 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS March 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	When a cluster mounts an NFS server
From:	John Haggerty <[log in to unmask]>
Reply To:	John Haggerty <[log in to unmask]>
Date:	Sun, 26 Mar 2006 12:43:18 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (42 lines)

What is the best way for a moderately large (I count about 180 machines) 
cluster of SL3.0.5 machines to NFS mount a fileserver which provides 
home directories, essential executables, common configuration files, 
etc. (i.e., not large amounts of data)?

The NFS server in question has been reasonably reliable (it is now a 
commercial NAS which internally is running a Debian variant, 
http://www.open-e.com/, but we have had homemade SL and Gentoo NFS 
servers in that position, and the same question comes up), but still, 
there are failures at what I crudely estimate as a MTBF of about 50-100 
days in which the NFS server fails so badly that it has to be rebooted 
or power cycled.  That may be an acceptable rate of failures, but when 
it happens, we almost end up rebooting most of the 180 machines in the 
cluster.  Again, we have tools that help with that, but it's certainly 
not a one-button operation.

So the question is, do we try to make the NFS server an order of 
magnitude more reliable (how do we do that?) or do we try to make the 
NFS clients recover more gracefully from a failure of their server (how 
do we do that?)?

The clients mount filesystems on the server with the automounter with 
configuration files supplied by a (Sun) NIS server--the mount directives 
are like:

-nfsvers=3,hard,wsize=32768,rsize=32768 
phnxsb0.phenix.bnl.gov:/share/software

so the mounts are all hard and nointr (the default).  The NFS FAQ scare 
one away from soft mounts; maybe intr would be better, but I have a 
feeling we'd end up rebooting anyway rather than search for processes 
that can be killed.  What would really be best would be if everything 
hangs while the NFS server is down, then the stale mounts disappear on 
the reboot of the server, the processes die, we restart them, and we're 
up and running again.  Is there any hope of accomplishing that?

-- 
John Haggerty
email: [log in to unmask]
voice/fax: 631 344 2286/4592
http://www.phenix.bnl.gov/~haggerty

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV