SCIENTIFIC-LINUX-USERS Archives

December 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Jon Peatfield <[log in to unmask]>
Reply To:
Jon Peatfield <[log in to unmask]>
Date:
Wed, 27 Dec 2006 18:43:23 +0000
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (213 lines)
On Wed, 27 Dec 2006, Doug Johnson wrote:

>> Miles O'Neal wrote:
>>> Doug,
>>>
>>> |I am at a loss as to how to debug the following problem. I have a
>>> |machine (Dual 800 MHz P3) running SL303 with all the latest available
>>> |patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
>>> |gateway between 2 networks, runs NIS and is an NFS server. I have also
>>> |installed fallback-reboot. Aside from that it is a pretty vanilla
>>> |machine. (I probably have more than 20 other machines that are identical
>>> |to this machine except that they are not gateways/routers.)
>>> |Unfortunately the machine is a bit unstable. It will run for between 1
>>> |and 4 days after which all access to the machine stops except that it
>>> |continues to pass packets between the 2 networks, i.e., routing works
>>> |and it answers pings, but NFS mounts, ssh, etc all fail.

Can I assume than an rpcinfo -p at the box doesn't show much either?

>>> | I can not even
>>> |connect to the fallback-reboot client on the machine which should be
>>> |sitting in memory. The machine runs in init level 3 and generally does
>>> |not have a monitor connected to it. But even if there is a monitor, once
>>> |it goes into this "failed" state the monitor is blank and does not
>>> |respond to key strokes. I also don't see anything unusual in the system
>>> |log files; the machine just stops logging errors at the time that it
>>> |fails. The failures generally (but not always) occur around 4AM, the
>>> |time that cron jobs run.

If for some reason the syslog/klog is failing to be able to talk to the 
disks bus is still running you may get some logs if you alter syslogd.conf 
to tell it to log to a remote server.  Of course if the machine is logging 
lots you probably don't want to do this for *all* syslog messages.

This may help detect the problem if the fault is caused by (say) the disk 
(or disk interface) failing for some reason.

>>> Things that I have already tried:
>>> |
>>> |	1) Replaced Ethernet cards
>>> |	2) Replace system drive and reinstalled OS (disk was showing
>>> |           wear via smartd utilities).
>>> |	3) Rebooted using non SMP kernel
>>> |	4) Replaced power supply
>>> |	5) All disks in system pass fsck tests
>>> |
>>> |I don't know enough about how the router/gateway works to understand how
>>> |that can continue function when everything else seems to have stopped
>>> |working. Any thoughts would be greatly appreciated.

If you 'halt' a Linux machine you can see the same behaviour -- if you 
don't have shutdown scripts which turn off the networking of course!

As far as the kernel is concerned a halted system is nothing special just 
that (almost) all userland processes are dead.  ie the kernel will 
continue to process requests etc and if networking is left up it will 
still deal with them as expected.

Now of course if you shutdown/halt a machine then normally it would turn 
off the networking as part of the normal shutdown scripts but if the 
userland processes die for some 'strange' reason then it can easily end up 
being left like this...

>>> We have seen similar problems on compute and desktop
>>> servers for several years, running various versions
>>> of RH and now SL304.  It's gotten better but we still
>>> see it.  The things we have found to cause this include:
>>>
>>> - memory problems (run diags)
>>> - mobo/cpu problems (can be hard to trace)
>>> - process/OS weirdness
>>> - just running out of memory
>>>
>>> I would hope the latter wouldn't be a problem
>>> with a router, but it may be worth remotely
>>> connecting (rsh, ssh, whatever) and unning
>>> "top -d 1" on an xterm or something.  If it's
>>> anything like the same problems we have, the
>>> display will either just freeze, or the net
>>> will drop the connection and you'll see the
>>> last things top knew about.  This has helped
>>> us track down running out of memory and rogue
>>> processes.
>>>
>>> -Miles
>>
>> If it is somewhat regularly dying at 4:00 a.m., I would suspect
>> something in the cron jobs.  I'd do a
>>
>> ls /etc/cron.*
>>
>> and see what's happening.
>> I'd then start removing things I either didn't really need, or that
>> might be the problem.
>>
>> Looking at one of my SL3 machines I have
>> /etc/cron.daily:
>> 00-logwatch  logrotate        prelink  slocate.cron  yum.cron
>> 0anacron     makewhatis.cron  rpm      tmpwatch
>>
>> /etc/cron.weekly:
>> 0anacron  makewhatis.cron
>>
>> anacron - I wouldn't worry about.
>> makewhatis - goes through your man pages setting up the -k function.
>> I'd take this out just cuz it really sucks up the CPU for the period of
>> time, and do you man page searches on a different machine.
>> prelink - could be a problem. Possibly it's catching something bad.  But
>> I've never seen this happen before.
>> rpm  - goes through your rpm's.  Possible, but I've never seen it do
>> anything bad before.
>> tmpwatch - Also, I've never seen it do anything bad
>> yum.cron - possible could be sucking up all your memory.  Disable it
>> temporarily to see if it helps.
>> slocate.cron - if it's turned on, and you have alot of disk, or possibly
>> some network disk, this could do something.  Also, if you never use
>> "locate", then you don't need this.
>>
>> This brings us to my two most likely suspects.
>> *If* you have logging turned on for your router, your log files can get
>> pretty big.  The last two, logwatch, and logrotate, both are accessing
>> those log files, both could potentially be either sucking up too much
>> memory, or even filling up the disk temporarily.
>> Now, you can safely turn off logwatch.  You just won't get your nightly
>> e-mail of what was going on.
>> But, if you turn off logwatch ... make sure you have some way of not
>> letting your log files grow extra big.
>>
>> Or, you might have something else in your cron jobs that I don't know
>> about.  I'd look.
>
> Greetings,
>
> I'd like to thank everyone for their suggestions. At this point I have:
>
> 	1) Trimmed /etc/cron.daily to:
>            logrotate  prelink  rpm  tmpwatch  yum.cron
>
> 	2) Changed some lines in /etc/crontab to:
>
> # 10 4 * * * root run-parts /etc/cron.daily
> 05 4 * * * root /etc/cron.daily/logrotate
> 08 4 * * * root /etc/cron.daily/prelink
> 10 4 * * * root /etc/cron.daily/rpm
> 12 4 * * * root /etc/cron.daily/tmpwatch
> 14 4 * * * root /etc/cron.daily/yum.cron
>
> 	3) I have remote windows opened to top and df with updates every
>           5 minutes.
>
> 	4) I also wrote a monitoring script that checks the status of
>           NFS and running processes. It logs the information to
> 	   /tmp and emails the information once per day at 8:00AM
>
> Just some added information, the machine has 1 GB RAM and ~4 GB of swap,
> so I doubt that it is running out of memory; the memory passes the BIOS
> memory check. I would expect faulty memory to result in a complete
> machine crash. (Could be wrong on that one though.) I still don't see
> how everything stops working except routing. I would have assumed that
> routing requires a running kernel?

Yup a running kernel is needed but not neccessarily anything else.  For 
example if a lump of memory is bad you can get any process which happens 
to try to use it getting killed (normally this *should* log something).

If the same bad memory then gets used for the next process to start then 
it will also fail etc etc.  At the point where logrotate (etc) HUPs 
syslogd it may try to allocate more memory and also die (causing no more 
logs to happen).

Any cron jobs which restart things or just use lots of memory are likely 
to trigger death if the memory is bad.

The BIOS memory checks are almost entirely worthless.  Many common memory 
faults will simply not be detected by it.

To properly check the memory you can run memtest86 (or memtest86+) for at 
least a couple of days.  Run it for longer if you have lots of memory or a 
slow CPU...

This is tedious but the only way to be fairly confident in the memory. 
I've had machines which will show up a memory fault about once a week 
running with memtest but fail at a higher rate in normal use -- the 
particular patterns of access which fail may not be tested very frequently 
even by memtest.  ECC memory is better at this for obvious reasons, but 
still not imune from odd falure modes.

> Unfortunately, this is a wait and see method of debugging. I don't want
> to resort to replacing the CPU/MB yet, but perhaps this is next on the
> list. Part of a complete solution is to stop using the machine as a
> router and use a port on a real router/switch to isolate the subnet.

If you have spare machines of the same type try swapping one of those with 
the one in service while you test the hardware.

If you don't have spare hardware then try taking out half the memory.  If 
that doesn't help then try swapping it with the other half...  You can do 
a binary-chop to find faulty dimms (or sockets)...

> 	Thank you for your suggestions,
> 	doug

Finally, faults can sometimes show up the most bizare effects.  e.g. not 
relevant to you but we had a KVM with 4 user-ports.  2 worked and the 
others didn't.  The maker claimed it all worked fine for them when we sent 
it back.  Eventually they sent an engineer out to watch it fail at our 
site (which it did) he checked the firmware and found that it was a bug in 
the code which interpretted the DDC info from the monitors -- and it only 
affected *that kind of port*...  In their tests they used a different 
monitor which didn't show up the bug.

  -- Jon

ATOM RSS1 RSS2