LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

November 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS November 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	perpetual NFS problems
From:	Miles O'Neal <[log in to unmask]>
Reply To:	Miles O'Neal <[log in to unmask]>
Date:	Wed, 8 Nov 2006 14:03:27 -0600
Content-Type:	text/plain
Parts/Attachments:	text/plain (60 lines)

Ever since I've gotten here (RH 7.1 days)
we've had NFS issues-- failures to mount,
failures to unmount, etc.

We use NIS to distribute group, passwd,
netrgoup and automount files, and automount
almost everything.  Tier one storage is on
NetApp filers, tier 2 is a variety of rackmount
PCs using RAID 5.  These run a variety of
Linux OSes, including RH7.1, RH9, SL304 and
SL40

Clients are running 304 with and without the
SDR kernel, 305 and 307.  We've tried the stock
304 nfs-utils and the previous rev.  All client
desktops and compute servers see occasional NFS
problems.  The infrastructire boxes don't seem
to have these problems, but they are lightly
loaded.

For a while we saw steady improvements.  Then,
based on a paper all over the web on using
Linux with NetApp, I modified the following
NFS mount options:

   OPTION	OLD	NEW
   timeo	7	600
   retrans	3	2

Things got much, much worse.  We had many,
many more failures to mount, and more whines
about unmount problems and locks.

I then changed these to

   timeo=10,retrans=5

and at the same time bumped up the automounter
timeout to unmount from 60 seconds to 300 seconds.
Things got better, but we still have mount failures.
Some of these have rather severe impacts on users.
(Failures in portions of distributed jobs can be
expensive.)

We have changed from having all filers multihomed
with links to each subnet to all filers on their
own subnet, with one legacy link from one NetApp
to one subnet until some ancient processes can be
revamped and restarted.

Any insights, recomnmendations, and/or experiences
would be appreciated.  (We *may* be able to move
to SL4 on client systems, but it's not yet clear
whether we would lose application vendor support.
But if that proved helpful for others, I would
like to know that as well.)

Thanks,
Miles

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV