SCIENTIFIC-LINUX-USERS Archives

February 2005

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Devin Bougie <[log in to unmask]>
Reply To:
Devin Bougie <[log in to unmask]>
Date:
Thu, 10 Feb 2005 10:11:53 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (167 lines)
Thanks Steve and Martin,

On Feb 10, 2005, at 3:41 AM, Bly, MJ (Martin) wrote:
> Having now read through the description of the problem (which wasn't
> available to me when I created the problem description quoted below!), 
> I
> don't think it's the same problem as reported by Devin.

Yes, I also agree these are separate issues.  The problem we're seeing 
affects all nfs clients (Solaris, Tru64, linux, ...) accessing 
nfs-exported filesystems hosted by a RH (RH9, RHEL3, FC3, ...) nfs 
server.  We do also see this with or without LVM.

Devin

>
> The problem we see is absolutely fatal and there is no way out -
> processes do not complete and there is no I/O blocking as described...
>
> That said, I've seen LVM implicated in some NFS related fatal lockups 
> of
> a different variety.
>
> Martin.
>
>
>> -----Original Message-----
>> From: [log in to unmask]
>> [mailto:[log in to unmask]] On
>> Behalf Of Steve Traylen
>> Sent: 09 February 2005 19:59
>> To: Devin Bougie
>> Cc: [log in to unmask]
>> Subject: Re: NFS server I/O blocking
>>
>>
>> On Wed, Feb 09, 2005 at 07:02:46PM -0000 or thereabouts,
>> Devin Bougie wrote:
>>> Hi All,
>>>
>>> We have been struggling with what appears to be a bug in RH
>> kernels.
>>> This results in local disk I/O blocking all NFS I/O (and,
>> as we show,
>>> subsequent local I/O).  This appears easy to reproduce:
>>> ----
>>> 1.test access from the nfs server to the exported disk:
>>> [root@server]# time touch /mnt/disk/testlocal
>>> 2.test access from an nfs client to the nfs mounted disk:
>>> [root@client]# time touch /nfs/server/disk/clienttest
>>> 3.start local I/O on the nfs server:
>>> [root@server]# dd if=/dev/zero of=/mnt/disk/zero bs=1K count=10M
>>> 4.while the dd is running, test I/O on the nfs server to
>> the exported
>>> disk:
>>> [root@server]# time touch /mnt/disk/testlocal2
>>> 5.while the dd is running, access the exported disk from
>> the nfs client:
>>> [root@client]# time touch /nfs/server/disk/clienttest2
>>> 6.one last time, while the dd is running, test access to
>> the exported
>>> disk from the nfs server:
>>> [root@server]# time touch /mnt/disk/testlocal3
>>> ----
>>>
>>> After these steps, the last two 'touch' commands take
>> anywhere from 30
>>> seconds to 3 minutes to complete.
>>>
>>> We have reproduced this using ext2, ext3, and jfs; with and without
>>> lvm, scsi, and RAID; and with various RH kernels on RH9, RHEL3, and
>>> FC3.  However, RH7.3 does not have this same problem.
>>>
>>> We opened a bugzilla bug
>>>
>> (https://bugzilla.redhat.com/bugzilla/show_bug.cgi?id=139937),
>>  but have
>>> not yet gotten any resolution.  For the time being (and on
>> disk servers
>>> where it is possible) we are restricting local logins to
>> avoid running
>>> into this.
>>>
>>> Has anyone on this list experienced similar problems?  If
>> so, what have
>>> you done about it?
>>
>> Hi Devin,
>>
>>   The following info from Martin Bly.
>>
>> <quote>
>> All,
>>
>>
>> This problem affects SL3 and RHEL nfs clients accessing
>> nfs-exported filesystemshosted by most (all) nfs servers
>> (linux, Solaris).
>>
>>
>> What happens is the client expects a reply from the server
>> for some access
>> function: the server sends the response but the client
>> doesn't receive it - the
>> kernel nfs layer looses it.  The client is hung and only a
>> reboot will free it.
>> Other clients and the same client accessing other areas of
>> the same file system
>> are still able to do so until they hit the locked file/directory.
>>
>>
>> SLAC spotted this with RHEL - and we and they suffer, as we
>> both did in some
>> RH7.3 nfs clients.  SLAC escallated to RedHat who eventually
>> provided a hot fix
>> - this is a binary kernel distribution for which they don't
>> release the source.
>> I can't get access to the fix (but haven't asked - we *might*
>> be able to claim
>> the fix via our single RHEL installation but it's doubtful.
>> It is expected the
>> fix will appear in RHEL3 Update 5 (not 4).  I don't think
>> RHEL4 will suffer -
>> different kernel - BUT:
>>
>>
>> This is actually a problem with a patch to the 2.7 kernel
>> back-ported to 2.6 andthen 2.4.  Redhat passed their fix to
>> the NFS developers and it appears they aregoing to back out
>> the patch rather than implement the fix.  I don't know which
>> patch it is (I'd back it out myself...)
>>
>>
>> So it is a client side problem - I'd not expect a fix for
>> RH7.3 soon if ever.
>>
>> Martin.
>> </quote>
>>
>>
>>
>>
>>>
>>> Thanks in advance for any thoughts.
>>>
>>> Devin
>>>
>>> --------------------
>>> Devin Bougie
>>> Laboratory for Elementary-Particle Physics
>>> Computer Group
>>> [log in to unmask]
>>
>> -- 
>> Steve Traylen
>> [log in to unmask]
>> http://www.gridpp.ac.uk/
>>
>>

--------------------
Devin Bougie
Laboratory for Elementary-Particle Physics
Computer Group
[log in to unmask]
(607) 254-8353

ATOM RSS1 RSS2