LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

November 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS November 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: perpetual NFS problems
From:	John Haggerty <[log in to unmask]>
Reply To:	John Haggerty <[log in to unmask]>
Date:	Wed, 8 Nov 2006 16:52:00 -0500
Content-Type:	text/plain
Parts/Attachments:	text/plain (108 lines)

I have devoted more hours than I can count to figuring out NFS problems. 
  I'm not sure whether I have actually solved them or just outlived 
them, but at the moment, I'm almost afraid to say it, things are 
actually working pretty well.

I have a similar situation, a couple hundred machines mounting files 
from a couple of  fileservers, although the fileservers themselves are 
either homemade or from a much lower-end vendor than Netapp, although 
our vendor has given me excellent support when I call them about 
problems that are clearly server-related.  All the configurations are 
distributed via NIS also, and we use the automounter, although this 
rather terrifying FAQ from the upstream vendor about automount made us 
make sure we had autofs-4.1.3-130.i386 on all of our machines:

http://kbase.redhat.com/faq/FAQ_79_5925.shtm

About a year ago, I had terrible problems:

http://listserv.fnal.gov/scripts/wa.exe?A2=ind0512&L=scientific-linux-users&T=0&P=18323

(and I see you did too), but we upgraded everything (ok, almost 
everything) to SL 3.0.5 with kernel 2.4.21-32.0.1.ELsmp.

The mount options we went to last summer are:

rw,nfsvers=3,tcp,hard,nointr,wsize=8192,rsize=8192,timeo=600,retrans=2

and the smaller wsize and rsize (used to be 32768) seemed to help, with 
very little loss in performance (measured by Bonnie++).  Those are 
basically what is recommended in that Netapp application note, which is 
an excellent source of information.  Changing the transport to tcp 
really seemed to make the mounts far more robust, as you might think; 
sometimes it feels like there is a momentary hang, but it generally 
recovers in a few seconds.

The diagnostic tools available for debugging NFS that I could find seem 
very limited; that's the real problem.  It's often difficult to even 
tell whether the problem is on the client side or the server side.

Good luck.

Miles O'Neal wrote:
> Ever since I've gotten here (RH 7.1 days)
> we've had NFS issues-- failures to mount,
> failures to unmount, etc.
> 
> We use NIS to distribute group, passwd,
> netrgoup and automount files, and automount
> almost everything.  Tier one storage is on
> NetApp filers, tier 2 is a variety of rackmount
> PCs using RAID 5.  These run a variety of
> Linux OSes, including RH7.1, RH9, SL304 and
> SL40
> 
> Clients are running 304 with and without the
> SDR kernel, 305 and 307.  We've tried the stock
> 304 nfs-utils and the previous rev.  All client
> desktops and compute servers see occasional NFS
> problems.  The infrastructire boxes don't seem
> to have these problems, but they are lightly
> loaded.
> 
> For a while we saw steady improvements.  Then,
> based on a paper all over the web on using
> Linux with NetApp, I modified the following
> NFS mount options:
> 
>    OPTION	OLD	NEW
>    timeo	7	600
>    retrans	3	2
> 
> Things got much, much worse.  We had many,
> many more failures to mount, and more whines
> about unmount problems and locks.
> 
> I then changed these to
> 
>    timeo=10,retrans=5
> 
> and at the same time bumped up the automounter
> timeout to unmount from 60 seconds to 300 seconds.
> Things got better, but we still have mount failures.
> Some of these have rather severe impacts on users.
> (Failures in portions of distributed jobs can be
> expensive.)
> 
> We have changed from having all filers multihomed
> with links to each subnet to all filers on their
> own subnet, with one legacy link from one NetApp
> to one subnet until some ancient processes can be
> revamped and restarted.
> 
> Any insights, recomnmendations, and/or experiences
> would be appreciated.  (We *may* be able to move
> to SL4 on client systems, but it's not yet clear
> whether we would lose application vendor support.
> But if that proved helpful for others, I would
> like to know that as well.)
> 
> Thanks,
> Miles

-- 
John Haggerty
email: [log in to unmask]
voice/fax: 631 344 2286/4592
http://www.phenix.bnl.gov/~haggerty

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV