LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

November 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS November 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: perpetual NFS problems
From:	Jos van Wezel <[log in to unmask]>
Reply To:	[log in to unmask]
Date:	Thu, 9 Nov 2006 09:53:59 +0100
Content-Type:	text/plain
Parts/Attachments:	text/plain (80 lines)

Miles,

at FZK we run a cluster of 1000 machines with SL 3.05
clients and 20 RH 4.2 servers with:

transport: tcp
timeo: 600
retrans: 2
nfsd: 250
autofs timeout: 1800

and are pretty happy with it. On average there are
4 to 5 mounts on a client.

Are you loosing packets on the server side? Is the re-assembly counter
increasing? (netstat -s).

J

Miles O'Neal wrote:
> Ever since I've gotten here (RH 7.1 days)
> we've had NFS issues-- failures to mount,
> failures to unmount, etc.
> 
> We use NIS to distribute group, passwd,
> netrgoup and automount files, and automount
> almost everything.  Tier one storage is on
> NetApp filers, tier 2 is a variety of rackmount
> PCs using RAID 5.  These run a variety of
> Linux OSes, including RH7.1, RH9, SL304 and
> SL40
> 
> Clients are running 304 with and without the
> SDR kernel, 305 and 307.  We've tried the stock
> 304 nfs-utils and the previous rev.  All client
> desktops and compute servers see occasional NFS
> problems.  The infrastructire boxes don't seem
> to have these problems, but they are lightly
> loaded.
> 
> For a while we saw steady improvements.  Then,
> based on a paper all over the web on using
> Linux with NetApp, I modified the following
> NFS mount options:
> 
>    OPTION	OLD	NEW
>    timeo	7	600
>    retrans	3	2
> 
> Things got much, much worse.  We had many,
> many more failures to mount, and more whines
> about unmount problems and locks.
> 
> I then changed these to
> 
>    timeo=10,retrans=5
> 
> and at the same time bumped up the automounter
> timeout to unmount from 60 seconds to 300 seconds.
> Things got better, but we still have mount failures.
> Some of these have rather severe impacts on users.
> (Failures in portions of distributed jobs can be
> expensive.)
> 
> We have changed from having all filers multihomed
> with links to each subnet to all filers on their
> own subnet, with one legacy link from one NetApp
> to one subnet until some ancient processes can be
> revamped and restarted.
> 
> Any insights, recomnmendations, and/or experiences
> would be appreciated.  (We *may* be able to move
> to SL4 on client systems, but it's not yet clear
> whether we would lose application vendor support.
> But if that proved helpful for others, I would
> like to know that as well.)
> 
> Thanks,
> Miles

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV