Hello,
a problem with Scientific Linux CERN 4: NFS clients at our site do
sometimes get stuck with pdflush eating 100% CPU time and the NFS
mounts not responding: every process trying to access the stuck NFS
mountpoint hangs.
One pdflush kernel thread is keeping a CPU at nearly 100%::
# ps auxwww | grep pdflush
root 12499 80.0 0.0 0 0 ? R Apr20 4338:56 [pdflush]
root 19992 0.0 0.0 0 0 ? S Apr20 0:00 [pdflush]
# ps -w -O ppid,cpu,wchan:20 12499 19992
PID PPID CPU WCHAN S TTY TIME COMMAND
12499 19 - - R ? 3-00:20:59 [pdflush]
19992 24 - pdflush S ? 00:00:00 [pdflush]
We've observed NFS being stuck this way only on clients with
*write* access, when some intensive I/O is done on the NFS tree;
clients with *read-only* access perform just fine, even in the
face of a server failure -- this points towards ``pdflush`` being
the culprit of the NFS hang.
The situation looks exactly like the one described in this LKML post:
http://www.ussg.indiana.edu/hypermail/linux/kernel/0404.3/0744.html
However, we're running kernel 2.6.9, while the post refers to
2.6.6-rc3::
# uname -a
Linux wn01.lcg.cscs.ch 2.6.9-67.0.7.EL.cernsmp #1 SMP Wed Mar 19 09:38:54 CET 2008 i686 athlon i386 GNU/Linux
Other facts that may (or may not) be relevant:
* NFS client mount options:
rw,hard,intr,nodev,nosuid,rsize=32768,wsize=32768
* NFS server export options:
rw,no_root_squash,no_subtree_check
The export options for NS read-only clients are a bit different:
ro,async,no_root_squash,no_subtree_check
* the hung clients can mount and access other NFS partitions, even
from the same NFS server::
# mkdir /tmp/mnt
# mount nfs-server:/local2/lcg-nfs/ks /tmp/mnt
# ls /mnt
LCG
# df -h /tmp/mnt
Filesystem Size Used Avail Use% Mounted on
nfs-server:/local2/lcg-nfs/ks
2.0T 1.5T 444G 77% /tmp/mnt
# umount /tmp/mnt
# rmdir /tmp/mnt
* trying to mount the hung NFS mountpoint on another mountpoint
hangs the ``mount`` command::
# ps -O cpu,wchan:20 19734
PID CPU WCHAN S TTY TIME COMMAND
19734 - - D pts/32 00:00:00 mount
nfs-server:/local2/experiment-software /tmp/m1
The hung ``mount`` process does not respond to *any* signal.
* we use autofs4 to manage the NFS mountpoints, but other
mountpoints (even on the same NFS server) still respond normally.
We were so far unable to do anything except reboot the hung nodes.
Any clue?
Regards,
Riccardo
|