LISTSERV - SCIENTIFIC-LINUX-DEVEL Archives

SCIENTIFIC-LINUX-DEVEL Archives

November 2005

SCIENTIFIC-LINUX-DEVEL@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-DEVEL Home
	SCIENTIFIC-LINUX-DEVEL November 2005

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	NFS over TCP problems
From:	Steven Timm <[log in to unmask]>
Reply To:	Steven Timm <[log in to unmask]>
Date:	Mon, 7 Nov 2005 10:20:24 -0600
Content-Type:	TEXT/PLAIN
Parts/Attachments:	TEXT/PLAIN (204 lines)

Dear Scientific Linux Community--
I have been seeing some problems with NFS over TCP with the
current set of kernels and wonder if anyone else has seen the following
symptoms. Some have suggested either increasing number of nfsd's to 300
or greater, or abandoning TCP altogether and reverting to UDP, which
we are reluctant to do, due to network problems.

Full configuration is below. Any help is appreciated.

Steven Timm

Node "fnpcsrv1" is new NFS server,
root@fnpcsrv1 lsi_home]# uname -a
Linux fnpcsrv1.fnal.gov 2.4.21-32.0.1.EL.XFSsmp #1 SMP Wed Jun 8 18:35:19
CEST 2005 i686 i686 i386 GNU/Linux
this is 4-way Dell Poweredge 6850, 16 GB RAM, running in 32-bit mode,
hyperthreading on.
All NFS-exported file systems are XFS file systems and have quotas.
Currently running 64 NFS daemons.
Disk hardware includes Dell Megaraid controller and LSI e2400 RAID
controller.

Kernel 2.4.21-32.0.1.EL.XFSsmp is the kernel recompiled from same
.src.rpm as Red Hat Enterprise Linux 3 update 5, with XFS patches applied.

Two NFS clients running 2.4.21-37.ELsmp, namely fnpcg, fngp-osg.
These are grid nodes, heavy consumers of NFS.

200-some worker node NFS clients running 2.4.21-27.0.2.ELsmp kernel.
(From Red Hat Enterprise Linux 3 update 4). Average of 3 mounts per
worker node.
All linux-to-linux NFS access is version 3 over TCP.

Problem #1: "Failed to mount xxxxxxx"
Seen in /var/log/messages of normal worker nodes and grid nodes.
All home and staging areas are mounted on worker nodes via the
automounter. We see on average 10-12 failures a day over a sample
of 150 worker nodes. Each failure can cause a job to fail.
No corresponding error message on the server when this happens.
These happen more when server is at high load.

Problem #2:
Sep 13 04:40:24 fngp-osg kernel: RPC: tcp_data_ready socket info not found!
Sep 13 04:40:26 fngp-osg kernel: nfs_safe_remove: 335e33085de705df938049bc260619
/lock busy, d_count=2
These 2 messages are seen only on the client nodes that are running
2.4.21-32.0.1 kernel. Sometimes they occur together, more usually
the nfs_safe_remove occurs alone.

If we investigate the file "lock" that is trying to be removed we find the
following:

ls of the directory it is in, on the client, does not show the file.
"ls lock" does show the file
"rm lock" claims the file does not exist.
A portion of the strace of the process in question.

link("/home/cdf/.globus/.gass_cache/global/md5/76/f3/51/365016f6aebcebeb671288a2ff/data", "/home/cdf/.globus/.gass_cache/global/md5/76/f3/51/365016f6aebcebeb671288a2ff/lock") = -1 EEXIST (File exists)
stat64("/home/cdf/.globus/.gass_cache/global/md5/76/f3/51/365016f6aebcebeb671288a2ff/lock", {st_mode=S_IFREG|0755, st_size=458, ...}) = 0
time(NULL) = 1130864802
unlink("/home/cdf/.globus/.gass_cache/global/md5/76/f3/51/365016f6aebcebeb671288a2ff/lock") = -1 ENOENT (No such file or directory)

On the server the file does not show up at all.

A umount and remount of the system on the client clears the problem
but that's not an operation that can always be done. The undeletable
file causes a large number of client processes to get very confused
and get caught in a tight loop of statting, unlinking, and linking the
same file over and over again. Only recourse is to kill the processes
in question.

2a) we see some stale NFS file handles on these systems as well, also
causing hung processes.

3) /var/log/messages on the server node.
Sep 20 05:03:31 fnpcsrv1 kernel: rpc-srv/tcp: nfsd: sent only -107 bytes
of 132- shutting down socket
Sep 20 05:03:32 fnpcsrv1 kernel: rpc-srv/tcp: nfsd: sent only -107 bytes
of 140- shutting down socket

These are most frequently associated with the two grid gatekeeper
nodes but we now know that they can be associated with any node.
NFS mailing list archives suggested to increase the number of NFSD,
we have increased to 64, these errors continue. NFS mailing list
says root cause is that the nfsd tries to do a sendto and gets a
ENOTCONN error from the client. We can see that error if we
are doing a tcpdump of the network traffic between client and server.

------------------------------------------------------------------
Things we have done so far:

a) Increased TCP buffering on the server and the two busiest clients
with the following tunes:

# increase Linux TCP buffer limits
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
net.core.wmem_default = 524288
net.core.rmem_default = 524288
net.ipv4.tcp_window_scaling = 1

# increase Linux autotuning TCP buffer limits
net.ipv4.tcp_rmem = 32768 87380 8388608
net.ipv4.tcp_wmem = 16384 65536 8388608
net.ipv4.tcp_mem = 8388608 8388608 8388608

b) Upgraded the two most busy clients to kernel 2.4.21-37.
c) Changed the NFS client mount options on 2 worker nodes.

To date:

1) The "Failed to mount" errors have continued when the nfs server is
at high load.

2) we still see tcp_data_ready errors, nfs_safe_remove errors,
hung processes trying to delete and undelete files, and hung processes
due to stale file handles. The rate seems to be slightly less than before.

3) rpc-srv/tcp messages continue at previous rate.
4) the two worker nodes with changed mount options (minus bg, add noac)
show same behavior as before, still with failed to mount errors
and tcp_data_ready errors.
---------------------------------------------------------------

Contents of /etc/exports

[root@fnpcsrv1 nfs]# cat /etc/exports
/export/lsi_home 131.225.167.0/255.255.254.0(rw,insecure,insecure_locks,no_subtree_check,sync)
/export/lsi_stage 131.225.167.0/255.255.254.0(rw,insecure,insecure_locks,no_subtree_check,sync)
/export/products 131.225.167.0/255.255.254.0(rw,insecure,insecure_locks,no_subtree_check,sync)
/export/stage 131.225.167.0/255.255.254.0(rw,insecure,insecure_locks,no_subtree_check,sync

Sample entry from /var/lib/nfs/xtab

/export/lsi_stage fnpc131.fnal.gov(rw,sync,wdelay,hide,nocrossmnt,insecure,root_squash,no_all_squash,no_subtree_check,insecure_locks,acl,mapping=identity,anonuid=-2,anongid=-2)

Mount options on client:

/home /etc/auto.home -rw,bg,hard,tcp,timeo=15,retrans=8,intr,rsize=8192,wsize=8192 0 0

(These options were working fine against an IRIX server).

-------------------------------------------------------------------
All default nfs and nfsd proc entries.

[root@fnpcsrv1 nfs]# cat nfs3_acl_max_entries
1024
[root@fnpcsrv1 nfs]# cat nlm_grace_period
0
[root@fnpcsrv1 nfs]# cat nlm_tcpport
0
[root@fnpcsrv1 nfs]# cat nlm_timeout
10
[root@fnpcsrv1 nfs]# cat nlm_udpport
0
[root@fnpcsrv1 nfs]# pwd
/proc/sys/fs/nfs
[root@fnpcsrv1 nfs]# [root@fnpcsrv1 nfsd]# cat nfsd3_acl_max_entries
1024
[root@fnpcsrv1 nfsd]# pwd
/proc/sys/fs/nfsd
[root@fnpcsrv1 nfsd]#

-----------------------------
Next steps:

1) Will load 2.4.21-37.EL.XFS kernel on server on 11/9 downtime.

2) increase the command tag queuing depth on our RAID array, hopefully
to improve the underlying disk performance and thus by extension the
NFS performance.

3) will change mount options, remove bg, add noac,rsize=32768, wsize=32768

4) add nfs/nfsd options to modules.conf
a) to all NFS clients, add
options nfs nfs3_acl_max_entries=256
b) to NFS servers also add
options nfsd nfsd3_acl_max_entries=256
5) reboot server into new kernel

6) add same sysctl.conf options as above to NFS clients

--
------------------------------------------------------------------
Steven C. Timm, Ph.D (630) 840-8525 [log in to unmask] http://home.fnal.gov/~timm/
Fermilab Computing Div/Core Support Services Dept./Scientific Computing Section
Assistant Group Leader, Farms and Clustered Systems Group
Lead of Computing Farms Team

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV