LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

April 2008

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS April 2008

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	NFS client stuck, pdflush eating 100% CPU
From:	Riccardo Murri <[log in to unmask]>
Reply To:	Riccardo Murri <[log in to unmask]>
Date:	Thu, 24 Apr 2008 14:59:26 +0200
Content-Type:	text/plain
Parts/Attachments:	text/plain (80 lines)

Hello,

a problem with Scientific Linux CERN 4: NFS clients at our site do
sometimes get stuck with pdflush eating 100% CPU time and the NFS
mounts not responding: every process trying to access the stuck NFS
mountpoint hangs.

One pdflush kernel thread is keeping a CPU at nearly 100%::

  # ps auxwww | grep pdflush
  root     12499 80.0  0.0     0    0 ?        R    Apr20 4338:56 [pdflush]
  root     19992  0.0  0.0     0    0 ?        S    Apr20   0:00  [pdflush]

  # ps -w -O ppid,cpu,wchan:20 12499 19992
    PID  PPID CPU WCHAN                S TTY          TIME COMMAND
  12499    19   - -                    R ?        3-00:20:59 [pdflush]
  19992    24   - pdflush              S ?        00:00:00 [pdflush]

We've observed NFS being stuck this way only on clients with
*write* access, when some intensive I/O is done on the NFS tree;
clients with *read-only* access perform just fine, even in the
face of a server failure -- this points towards ``pdflush`` being
the culprit of the NFS hang.

The situation looks exactly like the one described in this LKML post: 
  http://www.ussg.indiana.edu/hypermail/linux/kernel/0404.3/0744.html
However, we're running kernel 2.6.9, while the post refers to
2.6.6-rc3::

  # uname -a
  Linux wn01.lcg.cscs.ch 2.6.9-67.0.7.EL.cernsmp #1 SMP Wed Mar 19 09:38:54 CET 2008 i686 athlon i386 GNU/Linux

Other facts that may (or may not) be relevant:

  * NFS client mount options:

      rw,hard,intr,nodev,nosuid,rsize=32768,wsize=32768

  * NFS server export options:

      rw,no_root_squash,no_subtree_check

    The export options for NS read-only clients are a bit different:

      ro,async,no_root_squash,no_subtree_check

  * the hung clients can mount and access other NFS partitions, even
    from the same NFS server::

      # mkdir /tmp/mnt
      # mount nfs-server:/local2/lcg-nfs/ks /tmp/mnt
      # ls /mnt
      LCG  
      # df -h /tmp/mnt
      Filesystem         Size  Used Avail Use% Mounted on
      nfs-server:/local2/lcg-nfs/ks
                         2.0T  1.5T  444G  77% /tmp/mnt
      # umount /tmp/mnt
      # rmdir /tmp/mnt

  * trying to mount the hung NFS mountpoint on another mountpoint
    hangs the ``mount`` command::

      # ps -O cpu,wchan:20 19734
        PID CPU WCHAN                S TTY          TIME COMMAND
      19734   - -                    D pts/32   00:00:00 mount
        nfs-server:/local2/experiment-software /tmp/m1
 
    The hung ``mount`` process does not respond to *any* signal.

  * we use autofs4 to manage the NFS mountpoints, but other
    mountpoints (even on the same NFS server) still respond normally.


We were so far unable to do anything except reboot the hung nodes.
Any clue?

Regards,
Riccardo

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV