SCIENTIFIC-LINUX-USERS Archives

February 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
"Bly, MJ (Martin)" <[log in to unmask]>
Reply To:
Bly, MJ (Martin)
Date:
Thu, 10 Feb 2005 08:41:16 +0000
Content-Type:
text/plain
Parts/Attachments:
text/plain (148 lines)
Steve et al,

Having now read through the description of the problem (which wasn't
available to me when I created the problem description quoted below!), I
don't think it's the same problem as reported by Devin.

The problem we see is absolutely fatal and there is no way out -
processes do not complete and there is no I/O blocking as described...

That said, I've seen LVM implicated in some NFS related fatal lockups of
a different variety.

Martin.


> -----Original Message-----
> From: [log in to unmask] 
> [mailto:[log in to unmask]] On 
> Behalf Of Steve Traylen
> Sent: 09 February 2005 19:59
> To: Devin Bougie
> Cc: [log in to unmask]
> Subject: Re: NFS server I/O blocking
> 
> 
> On Wed, Feb 09, 2005 at 07:02:46PM -0000 or thereabouts, 
> Devin Bougie wrote:
> > Hi All,
> > 
> > We have been struggling with what appears to be a bug in RH 
> kernels.  
> > This results in local disk I/O blocking all NFS I/O (and, 
> as we show, 
> > subsequent local I/O).  This appears easy to reproduce:
> > ----
> > 1.test access from the nfs server to the exported disk:
> > [root@server]# time touch /mnt/disk/testlocal
> > 2.test access from an nfs client to the nfs mounted disk:
> > [root@client]# time touch /nfs/server/disk/clienttest
> > 3.start local I/O on the nfs server:
> > [root@server]# dd if=/dev/zero of=/mnt/disk/zero bs=1K count=10M
> > 4.while the dd is running, test I/O on the nfs server to 
> the exported 
> > disk:
> > [root@server]# time touch /mnt/disk/testlocal2
> > 5.while the dd is running, access the exported disk from 
> the nfs client:
> > [root@client]# time touch /nfs/server/disk/clienttest2
> > 6.one last time, while the dd is running, test access to 
> the exported 
> > disk from the nfs server:
> > [root@server]# time touch /mnt/disk/testlocal3
> > ----
> > 
> > After these steps, the last two 'touch' commands take 
> anywhere from 30 
> > seconds to 3 minutes to complete.
> > 
> > We have reproduced this using ext2, ext3, and jfs; with and without 
> > lvm, scsi, and RAID; and with various RH kernels on RH9, RHEL3, and 
> > FC3.  However, RH7.3 does not have this same problem.
> > 
> > We opened a bugzilla bug 
> > 
> (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=139937),
>  but have 
> > not yet gotten any resolution.  For the time being (and on 
> disk servers 
> > where it is possible) we are restricting local logins to 
> avoid running 
> > into this.
> > 
> > Has anyone on this list experienced similar problems?  If 
> so, what have 
> > you done about it?
> 
> Hi Devin,
> 
>   The following info from Martin Bly.
> 
> <quote>
> All,
>                                                               
>                   
> This problem affects SL3 and RHEL nfs clients accessing 
> nfs-exported filesystemshosted by most (all) nfs servers 
> (linux, Solaris).
>                                                               
>                   
> What happens is the client expects a reply from the server 
> for some access
> function: the server sends the response but the client 
> doesn't receive it - the
> kernel nfs layer looses it.  The client is hung and only a 
> reboot will free it.
> Other clients and the same client accessing other areas of 
> the same file system
> are still able to do so until they hit the locked file/directory.
>                                                               
>                   
> SLAC spotted this with RHEL - and we and they suffer, as we 
> both did in some
> RH7.3 nfs clients.  SLAC escallated to RedHat who eventually 
> provided a hot fix
> - this is a binary kernel distribution for which they don't 
> release the source.
> I can't get access to the fix (but haven't asked - we *might* 
> be able to claim
> the fix via our single RHEL installation but it's doubtful.  
> It is expected the
> fix will appear in RHEL3 Update 5 (not 4).  I don't think 
> RHEL4 will suffer -
> different kernel - BUT:
>                                                               
>                   
> This is actually a problem with a patch to the 2.7 kernel 
> back-ported to 2.6 andthen 2.4.  Redhat passed their fix to 
> the NFS developers and it appears they aregoing to back out 
> the patch rather than implement the fix.  I don't know which
> patch it is (I'd back it out myself...)
>                                                               
>                   
> So it is a client side problem - I'd not expect a fix for 
> RH7.3 soon if ever.
> 
> Martin.
> </quote>
> 
> 
> 
> 
> > 
> > Thanks in advance for any thoughts.
> > 
> > Devin
> > 
> > --------------------
> > Devin Bougie
> > Laboratory for Elementary-Particle Physics
> > Computer Group
> > [log in to unmask]
> 
> -- 
> Steve Traylen
> [log in to unmask]
> http://www.gridpp.ac.uk/
> 

ATOM RSS1 RSS2