I have devoted more hours than I can count to figuring out NFS problems.
I'm not sure whether I have actually solved them or just outlived
them, but at the moment, I'm almost afraid to say it, things are
actually working pretty well.
I have a similar situation, a couple hundred machines mounting files
from a couple of fileservers, although the fileservers themselves are
either homemade or from a much lower-end vendor than Netapp, although
our vendor has given me excellent support when I call them about
problems that are clearly server-related. All the configurations are
distributed via NIS also, and we use the automounter, although this
rather terrifying FAQ from the upstream vendor about automount made us
make sure we had autofs-4.1.3-130.i386 on all of our machines:
http://kbase.redhat.com/faq/FAQ_79_5925.shtm
About a year ago, I had terrible problems:
http://listserv.fnal.gov/scripts/wa.exe?A2=ind0512&L=scientific-linux-users&T=0&P=18323
(and I see you did too), but we upgraded everything (ok, almost
everything) to SL 3.0.5 with kernel 2.4.21-32.0.1.ELsmp.
The mount options we went to last summer are:
rw,nfsvers=3,tcp,hard,nointr,wsize=8192,rsize=8192,timeo=600,retrans=2
and the smaller wsize and rsize (used to be 32768) seemed to help, with
very little loss in performance (measured by Bonnie++). Those are
basically what is recommended in that Netapp application note, which is
an excellent source of information. Changing the transport to tcp
really seemed to make the mounts far more robust, as you might think;
sometimes it feels like there is a momentary hang, but it generally
recovers in a few seconds.
The diagnostic tools available for debugging NFS that I could find seem
very limited; that's the real problem. It's often difficult to even
tell whether the problem is on the client side or the server side.
Good luck.
Miles O'Neal wrote:
> Ever since I've gotten here (RH 7.1 days)
> we've had NFS issues-- failures to mount,
> failures to unmount, etc.
>
> We use NIS to distribute group, passwd,
> netrgoup and automount files, and automount
> almost everything. Tier one storage is on
> NetApp filers, tier 2 is a variety of rackmount
> PCs using RAID 5. These run a variety of
> Linux OSes, including RH7.1, RH9, SL304 and
> SL40
>
> Clients are running 304 with and without the
> SDR kernel, 305 and 307. We've tried the stock
> 304 nfs-utils and the previous rev. All client
> desktops and compute servers see occasional NFS
> problems. The infrastructire boxes don't seem
> to have these problems, but they are lightly
> loaded.
>
> For a while we saw steady improvements. Then,
> based on a paper all over the web on using
> Linux with NetApp, I modified the following
> NFS mount options:
>
> OPTION OLD NEW
> timeo 7 600
> retrans 3 2
>
> Things got much, much worse. We had many,
> many more failures to mount, and more whines
> about unmount problems and locks.
>
> I then changed these to
>
> timeo=10,retrans=5
>
> and at the same time bumped up the automounter
> timeout to unmount from 60 seconds to 300 seconds.
> Things got better, but we still have mount failures.
> Some of these have rather severe impacts on users.
> (Failures in portions of distributed jobs can be
> expensive.)
>
> We have changed from having all filers multihomed
> with links to each subnet to all filers on their
> own subnet, with one legacy link from one NetApp
> to one subnet until some ancient processes can be
> revamped and restarted.
>
> Any insights, recomnmendations, and/or experiences
> would be appreciated. (We *may* be able to move
> to SL4 on client systems, but it's not yet clear
> whether we would lose application vendor support.
> But if that proved helpful for others, I would
> like to know that as well.)
>
> Thanks,
> Miles
--
John Haggerty
email: [log in to unmask]
voice/fax: 631 344 2286/4592
http://www.phenix.bnl.gov/~haggerty
|