SCIENTIFIC-LINUX-USERS Archives

December 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Doug Johnson <[log in to unmask]>
Reply To:
Doug Johnson <[log in to unmask]>
Date:
Wed, 27 Dec 2006 10:51:33 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (150 lines)
> 
> Miles O'Neal wrote:
> > Doug,
> > 
> > |I am at a loss as to how to debug the following problem. I have a
> > |machine (Dual 800 MHz P3) running SL303 with all the latest available
> > |patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
> > |gateway between 2 networks, runs NIS and is an NFS server. I have also
> > |installed fallback-reboot. Aside from that it is a pretty vanilla
> > |machine. (I probably have more than 20 other machines that are identical
> > |to this machine except that they are not gateways/routers.)
> > |Unfortunately the machine is a bit unstable. It will run for between 1
> > |and 4 days after which all access to the machine stops except that it
> > |continues to pass packets between the 2 networks, i.e., routing works
> > |and it answers pings, but NFS mounts, ssh, etc all fail. I can not even
> > |connect to the fallback-reboot client on the machine which should be
> > |sitting in memory. The machine runs in init level 3 and generally does
> > |not have a monitor connected to it. But even if there is a monitor, once
> > |it goes into this "failed" state the monitor is blank and does not
> > |respond to key strokes. I also don't see anything unusual in the system
> > |log files; the machine just stops logging errors at the time that it
> > |fails. The failures generally (but not always) occur around 4AM, the
> > |time that cron jobs run. Things that I have already tried:
> > |
> > |	1) Replaced Ethernet cards
> > |	2) Replace system drive and reinstalled OS (disk was showing
> > |           wear via smartd utilities).
> > |	3) Rebooted using non SMP kernel
> > |	4) Replaced power supply
> > |	5) All disks in system pass fsck tests
> > |
> > |I don't know enough about how the router/gateway works to understand how
> > |that can continue function when everything else seems to have stopped
> > |working. Any thoughts would be greatly appreciated.
> > 
> > We have seen similar problems on compute and desktop
> > servers for several years, running various versions
> > of RH and now SL304.  It's gotten better but we still
> > see it.  The things we have found to cause this include:
> > 
> > - memory problems (run diags)
> > - mobo/cpu problems (can be hard to trace)
> > - process/OS weirdness
> > - just running out of memory
> > 
> > I would hope the latter wouldn't be a problem
> > with a router, but it may be worth remotely
> > connecting (rsh, ssh, whatever) and unning
> > "top -d 1" on an xterm or something.  If it's
> > anything like the same problems we have, the
> > display will either just freeze, or the net
> > will drop the connection and you'll see the
> > last things top knew about.  This has helped
> > us track down running out of memory and rogue
> > processes.
> > 
> > -Miles
> 
> If it is somewhat regularly dying at 4:00 a.m., I would suspect 
> something in the cron jobs.  I'd do a
> 
> ls /etc/cron.*
> 
> and see what's happening.
> I'd then start removing things I either didn't really need, or that 
> might be the problem.
> 
> Looking at one of my SL3 machines I have
> /etc/cron.daily:
> 00-logwatch  logrotate        prelink  slocate.cron  yum.cron
> 0anacron     makewhatis.cron  rpm      tmpwatch
> 
> /etc/cron.weekly:
> 0anacron  makewhatis.cron
> 
> anacron - I wouldn't worry about.
> makewhatis - goes through your man pages setting up the -k function. 
> I'd take this out just cuz it really sucks up the CPU for the period of 
> time, and do you man page searches on a different machine.
> prelink - could be a problem. Possibly it's catching something bad.  But 
> I've never seen this happen before.
> rpm  - goes through your rpm's.  Possible, but I've never seen it do 
> anything bad before.
> tmpwatch - Also, I've never seen it do anything bad
> yum.cron - possible could be sucking up all your memory.  Disable it 
> temporarily to see if it helps.
> slocate.cron - if it's turned on, and you have alot of disk, or possibly 
> some network disk, this could do something.  Also, if you never use 
> "locate", then you don't need this.
> 
> This brings us to my two most likely suspects.
> *If* you have logging turned on for your router, your log files can get 
> pretty big.  The last two, logwatch, and logrotate, both are accessing 
> those log files, both could potentially be either sucking up too much 
> memory, or even filling up the disk temporarily.
> Now, you can safely turn off logwatch.  You just won't get your nightly 
> e-mail of what was going on.
> But, if you turn off logwatch ... make sure you have some way of not 
> letting your log files grow extra big.
> 
> Or, you might have something else in your cron jobs that I don't know 
> about.  I'd look.

Greetings,

I'd like to thank everyone for their suggestions. At this point I have:

	1) Trimmed /etc/cron.daily to:
            logrotate  prelink  rpm  tmpwatch  yum.cron

	2) Changed some lines in /etc/crontab to:

# 10 4 * * * root run-parts /etc/cron.daily
05 4 * * * root /etc/cron.daily/logrotate
08 4 * * * root /etc/cron.daily/prelink
10 4 * * * root /etc/cron.daily/rpm
12 4 * * * root /etc/cron.daily/tmpwatch
14 4 * * * root /etc/cron.daily/yum.cron

	3) I have remote windows opened to top and df with updates every
           5 minutes.

	4) I also wrote a monitoring script that checks the status of
           NFS and running processes. It logs the information to
	   /tmp and emails the information once per day at 8:00AM

Just some added information, the machine has 1 GB RAM and ~4 GB of swap,
so I doubt that it is running out of memory; the memory passes the BIOS
memory check. I would expect faulty memory to result in a complete
machine crash. (Could be wrong on that one though.) I still don't see
how everything stops working except routing. I would have assumed that
routing requires a running kernel?

Unfortunately, this is a wait and see method of debugging. I don't want
to resort to replacing the CPU/MB yet, but perhaps this is next on the
list. Part of a complete solution is to stop using the machine as a
router and use a port on a real router/switch to isolate the subnet.

	Thank you for your suggestions,
	doug
		
---------------------------------------------------------------------------- 
   Doug Johnson                    email: [log in to unmask]        
   B390, Duane Physics             (303)-492-4506 Office                     
   Boulder, CO 80309               (303)-492-5119 FAX                        
                                   http://www.aaccchildren.org               
   music is the greatest of the arts for me because it cuts through 
   everything, needs no aids. it is. it simply is.
----------------------------------------------------------------------------

ATOM RSS1 RSS2