SCIENTIFIC-LINUX-USERS Archives

November 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jos van Wezel <[log in to unmask]>
Reply To:
Date:
Thu, 9 Nov 2006 09:53:59 +0100
Content-Type:
text/plain
Parts/Attachments:
text/plain (80 lines)
Miles,

at FZK we run a cluster of 1000 machines with SL 3.05
clients and 20 RH 4.2 servers with:

transport: tcp
timeo: 600
retrans: 2
nfsd: 250
autofs timeout: 1800

and are pretty happy with it. On average there are
4 to 5 mounts on a client.

Are you loosing packets on the server side? Is the re-assembly counter
increasing? (netstat -s).

J

Miles O'Neal wrote:
> Ever since I've gotten here (RH 7.1 days)
> we've had NFS issues-- failures to mount,
> failures to unmount, etc.
> 
> We use NIS to distribute group, passwd,
> netrgoup and automount files, and automount
> almost everything.  Tier one storage is on
> NetApp filers, tier 2 is a variety of rackmount
> PCs using RAID 5.  These run a variety of
> Linux OSes, including RH7.1, RH9, SL304 and
> SL40
> 
> Clients are running 304 with and without the
> SDR kernel, 305 and 307.  We've tried the stock
> 304 nfs-utils and the previous rev.  All client
> desktops and compute servers see occasional NFS
> problems.  The infrastructire boxes don't seem
> to have these problems, but they are lightly
> loaded.
> 
> For a while we saw steady improvements.  Then,
> based on a paper all over the web on using
> Linux with NetApp, I modified the following
> NFS mount options:
> 
>    OPTION	OLD	NEW
>    timeo	7	600
>    retrans	3	2
> 
> Things got much, much worse.  We had many,
> many more failures to mount, and more whines
> about unmount problems and locks.
> 
> I then changed these to
> 
>    timeo=10,retrans=5
> 
> and at the same time bumped up the automounter
> timeout to unmount from 60 seconds to 300 seconds.
> Things got better, but we still have mount failures.
> Some of these have rather severe impacts on users.
> (Failures in portions of distributed jobs can be
> expensive.)
> 
> We have changed from having all filers multihomed
> with links to each subnet to all filers on their
> own subnet, with one legacy link from one NetApp
> to one subnet until some ancient processes can be
> revamped and restarted.
> 
> Any insights, recomnmendations, and/or experiences
> would be appreciated.  (We *may* be able to move
> to SL4 on client systems, but it's not yet clear
> whether we would lose application vendor support.
> But if that proved helpful for others, I would
> like to know that as well.)
> 
> Thanks,
> Miles

ATOM RSS1 RSS2