SCIENTIFIC-LINUX-USERS Archives

July 2009

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jon Peatfield <[log in to unmask]>
Reply To:
Jon Peatfield <[log in to unmask]>
Date:
Wed, 22 Jul 2009 22:55:54 +0100
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (60 lines)
Last week we started to get fairly frequent segfaults from a number of 
daemons (mostly rpc.mountd and portmap but also a few other things). 
This was on our main home directory file server which is running sl44 
(with security updates of course).

At first we thought it was the recent kernel update to 2.6.9-89.0.3.EL 
(which we had updated to the day before) but rebooting into 
2.6.9-78.0.22.EL didn't seem to help.  We found that by loading up the 
machine we could trigger the segfaults within about 60 mins.

So naturally since mountd/portmap hadn't changed for *ages* we assumed 
that it was a memory problem and spent quite a long time testing it 
(though we hadn't seen a single ECC event in the BIOS logs or similar), 
but didn't find any failures.

We also checked that all the binaries and shared libs had the same md5 
checksums that we have on another box which didn't seem to have the 
problem (we don't do prelink so that is easy).

In desperation (this had been going on for a week) I checked what region 
of memory the segfaults were in and they were all much like:

   Jul 16 14:08:40 cingulum kernel: rpc.mountd[3316]: segfault at
      0000002b0557c8b8 rip 0000002a95981f0f rsp 0000007fbffff770 error 4

where rip corresponded (according to /proc/.../maps) to one of the regions 
of /lib64/tls/libc-2.3.4.so which sort of poins towards glibs, so I 
checked for similar sounding segfaults (in TUV bugzilla), and found a 
couple of plausable things - which were all marked as fixed - in the EL48 
version of glibc.

So we updated glibc to the version from the sl48 tree and since then 
(nearly 5 days though that does include the weekend) we havn't had a 
single segfault - even when loading up the machine.

The most likely bug mentioned as fixed is some non SMP safe code, but I'm 
not really happy because:

   I don't think that mountd (or portmap etc) are likely to be
    multi-threaded apps so I'm not really sure why SMP safeness is relevant

   I have no good explanation as to why we only started seeing this
    recently given that we have been using the same glibc for a very long
    time

   Another server which has the same hardware (but doesn't serve much NFS),
    and the same software (sl44 but we havn't updated glibc etc) hasn't
    shown the effect even under fairly heavy load

Of course the patterns of use might make one box more likely to show the 
problem, but it doesn't feel like a very satisfatory conclusion.

I'm also reluctant to claim this is 'solved' because I don't want to tempt 
the bug/problem back into action...

Anyway in case anyone else is pulling their hair out over something 
similar you may find that updating glibc _may_ help.

  -- Jon

ATOM RSS1 RSS2