Ever since I've gotten here (RH 7.1 days)
we've had NFS issues-- failures to mount,
failures to unmount, etc.
We use NIS to distribute group, passwd,
netrgoup and automount files, and automount
almost everything. Tier one storage is on
NetApp filers, tier 2 is a variety of rackmount
PCs using RAID 5. These run a variety of
Linux OSes, including RH7.1, RH9, SL304 and
SL40
Clients are running 304 with and without the
SDR kernel, 305 and 307. We've tried the stock
304 nfs-utils and the previous rev. All client
desktops and compute servers see occasional NFS
problems. The infrastructire boxes don't seem
to have these problems, but they are lightly
loaded.
For a while we saw steady improvements. Then,
based on a paper all over the web on using
Linux with NetApp, I modified the following
NFS mount options:
OPTION OLD NEW
timeo 7 600
retrans 3 2
Things got much, much worse. We had many,
many more failures to mount, and more whines
about unmount problems and locks.
I then changed these to
timeo=10,retrans=5
and at the same time bumped up the automounter
timeout to unmount from 60 seconds to 300 seconds.
Things got better, but we still have mount failures.
Some of these have rather severe impacts on users.
(Failures in portions of distributed jobs can be
expensive.)
We have changed from having all filers multihomed
with links to each subnet to all filers on their
own subnet, with one legacy link from one NetApp
to one subnet until some ancient processes can be
revamped and restarted.
Any insights, recomnmendations, and/or experiences
would be appreciated. (We *may* be able to move
to SL4 on client systems, but it's not yet clear
whether we would lose application vendor support.
But if that proved helpful for others, I would
like to know that as well.)
Thanks,
Miles
|