SCIENTIFIC-LINUX-USERS Archives

October 2008

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Miles O'Neal <[log in to unmask]>
Reply To:
Miles O'Neal <[log in to unmask]>
Date:
Fri, 31 Oct 2008 18:47:40 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (78 lines)
We have a 5.2 system we're using as a storage
server/filer running nfsd.  We have hundreds
of nodes that can hit it at one time; these
clients are configured with autofs rather
than permanent mounts (legacy from the early
days).

We use NFS3 over TCP.  Originally we configured
the system with 100 daemons.  Very quickly we
started having jobs fail on the clients, and
the server log had lots of messages that say:

   kernel: lockd: too many open TCP sockets, consider increasing the number of nfsd threads

So we bumped it to 300, rebooting because we
had a new kernel to run, anyway.  Worked great
for a few days, then we started getting failures
again.

I bumped it up to 500 daemons and tried to
restart nfsd.  nfsd refused to start, saying
the port was busy.  I couldn't find anything
that I'd expect to use that port.  I finally
rebooted.  No nfs.  In the message log we now
had:

kernel: nfsd: Could not allocate memory read-ahead cache.
nfsd[6413]: nfssvc: Cannot allocate memory

[We have 8GB of RAM on the system, and at boot
time with 300 nfsd we don't even come close
to using 8GB.]

Backed down to 300, had to reboot as nfs would
not start.  It came up fine, but we still see
those pesky failures.

It gets more interesting.  Or bizarre.

% cat /proc/net/rpc/nfsd
rc 13537 33496396 192754161
fh 28 0 0 0 0
io 3943998555 1199297042
th 300 0 1188.353 239.850 65.863 16.361 1.857 0.000 0.000 0.000 0.000 0.000
ra 600 1328847 18752 16893 13305 9929 6954 5154 4301 3170 2710 0
net 226265416 0 226264783 70942
rpc 226260856 0 0 0 0
proc2 18 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
proc3 22 4477 96527885 5693324 38850837 37663631 12004 3694379 10771245 7160052 1719510 42932 0 3152360 971863 3965505 33110 159 4197685 14857 4550 0 8837637
proc4 2 0 0
proc4ops 40 2284365 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

As I understand things, the "th" line here says that
we have never come close to using all the nfs daemons
at one time!

So, we have two (possibly) problems.

1) Are the stats wrong, or is the problem not really
   in the number of threads?  This is a fast, dual,
   quadcore SuperMicro server, so I'm not worried
   that it can handle the load; we have much slower
   systems handling 100 threads without a hiccup
   (the nature of the projects means this newer
   system will get a lot more traffic).

   The NIC doesn't seem to be swamped.

   Is there a kernel param I need to tweak for
   more open sockets or something?

2) If I do need more daemons, how do I determine
   how much memory I need?  What is the limit on
   the number of daemons?

Thanks,
Miles

ATOM RSS1 RSS2