Miles,
at FZK we run a cluster of 1000 machines with SL 3.05
clients and 20 RH 4.2 servers with:
transport: tcp
timeo: 600
retrans: 2
nfsd: 250
autofs timeout: 1800
and are pretty happy with it. On average there are
4 to 5 mounts on a client.
Are you loosing packets on the server side? Is the re-assembly counter
increasing? (netstat -s).
J
Miles O'Neal wrote:
> Ever since I've gotten here (RH 7.1 days)
> we've had NFS issues-- failures to mount,
> failures to unmount, etc.
>
> We use NIS to distribute group, passwd,
> netrgoup and automount files, and automount
> almost everything. Tier one storage is on
> NetApp filers, tier 2 is a variety of rackmount
> PCs using RAID 5. These run a variety of
> Linux OSes, including RH7.1, RH9, SL304 and
> SL40
>
> Clients are running 304 with and without the
> SDR kernel, 305 and 307. We've tried the stock
> 304 nfs-utils and the previous rev. All client
> desktops and compute servers see occasional NFS
> problems. The infrastructire boxes don't seem
> to have these problems, but they are lightly
> loaded.
>
> For a while we saw steady improvements. Then,
> based on a paper all over the web on using
> Linux with NetApp, I modified the following
> NFS mount options:
>
> OPTION OLD NEW
> timeo 7 600
> retrans 3 2
>
> Things got much, much worse. We had many,
> many more failures to mount, and more whines
> about unmount problems and locks.
>
> I then changed these to
>
> timeo=10,retrans=5
>
> and at the same time bumped up the automounter
> timeout to unmount from 60 seconds to 300 seconds.
> Things got better, but we still have mount failures.
> Some of these have rather severe impacts on users.
> (Failures in portions of distributed jobs can be
> expensive.)
>
> We have changed from having all filers multihomed
> with links to each subnet to all filers on their
> own subnet, with one legacy link from one NetApp
> to one subnet until some ancient processes can be
> revamped and restarted.
>
> Any insights, recomnmendations, and/or experiences
> would be appreciated. (We *may* be able to move
> to SL4 on client systems, but it's not yet clear
> whether we would lose application vendor support.
> But if that proved helpful for others, I would
> like to know that as well.)
>
> Thanks,
> Miles
|