Jos van Wezel said...

|at FZK we run a cluster of 1000 machines with SL 3.05
|clients and 20 RH 4.2 servers with:
|
|transport: tcp
|timeo: 600
|retrans: 2
|nfsd: 250
|autofs timeout: 1800
|
|and are pretty happy with it. On average there are
|4 to 5 mounts on a client.
|
|Are you loosing packets on the server side? Is the re-assembly counter
|increasing? (netstat -s).

Not that I can tell.  The NetApp shows a variety of
errors, but the counts are all low, and much lower
than the number of problems seen.

I'm probably going to raise the automounter timeout
again.  It won't fix the problem, but going from
1 to 5 minutes seems to have reduced the frequency
of occurance, so the users are happier (if not
happy).

We have seen some RPC whining on a few of the clients
but again, very few, with no apparent correlation
to the NFS failures.

I'm starting to miss Solaris. 8^/