>
> Miles O'Neal wrote:
> > Doug,
> >
> > |I am at a loss as to how to debug the following problem. I have a
> > |machine (Dual 800 MHz P3) running SL303 with all the latest available
> > |patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
> > |gateway between 2 networks, runs NIS and is an NFS server. I have also
> > |installed fallback-reboot. Aside from that it is a pretty vanilla
> > |machine. (I probably have more than 20 other machines that are identical
> > |to this machine except that they are not gateways/routers.)
> > |Unfortunately the machine is a bit unstable. It will run for between 1
> > |and 4 days after which all access to the machine stops except that it
> > |continues to pass packets between the 2 networks, i.e., routing works
> > |and it answers pings, but NFS mounts, ssh, etc all fail. I can not even
> > |connect to the fallback-reboot client on the machine which should be
> > |sitting in memory. The machine runs in init level 3 and generally does
> > |not have a monitor connected to it. But even if there is a monitor, once
> > |it goes into this "failed" state the monitor is blank and does not
> > |respond to key strokes. I also don't see anything unusual in the system
> > |log files; the machine just stops logging errors at the time that it
> > |fails. The failures generally (but not always) occur around 4AM, the
> > |time that cron jobs run. Things that I have already tried:
> > |
> > | 1) Replaced Ethernet cards
> > | 2) Replace system drive and reinstalled OS (disk was showing
> > | wear via smartd utilities).
> > | 3) Rebooted using non SMP kernel
> > | 4) Replaced power supply
> > | 5) All disks in system pass fsck tests
> > |
> > |I don't know enough about how the router/gateway works to understand how
> > |that can continue function when everything else seems to have stopped
> > |working. Any thoughts would be greatly appreciated.
> >
> > We have seen similar problems on compute and desktop
> > servers for several years, running various versions
> > of RH and now SL304. It's gotten better but we still
> > see it. The things we have found to cause this include:
> >
> > - memory problems (run diags)
> > - mobo/cpu problems (can be hard to trace)
> > - process/OS weirdness
> > - just running out of memory
> >
> > I would hope the latter wouldn't be a problem
> > with a router, but it may be worth remotely
> > connecting (rsh, ssh, whatever) and unning
> > "top -d 1" on an xterm or something. If it's
> > anything like the same problems we have, the
> > display will either just freeze, or the net
> > will drop the connection and you'll see the
> > last things top knew about. This has helped
> > us track down running out of memory and rogue
> > processes.
> >
> > -Miles
>
> If it is somewhat regularly dying at 4:00 a.m., I would suspect
> something in the cron jobs. I'd do a
>
> ls /etc/cron.*
>
> and see what's happening.
> I'd then start removing things I either didn't really need, or that
> might be the problem.
>
> Looking at one of my SL3 machines I have
> /etc/cron.daily:
> 00-logwatch logrotate prelink slocate.cron yum.cron
> 0anacron makewhatis.cron rpm tmpwatch
>
> /etc/cron.weekly:
> 0anacron makewhatis.cron
>
> anacron - I wouldn't worry about.
> makewhatis - goes through your man pages setting up the -k function.
> I'd take this out just cuz it really sucks up the CPU for the period of
> time, and do you man page searches on a different machine.
> prelink - could be a problem. Possibly it's catching something bad. But
> I've never seen this happen before.
> rpm - goes through your rpm's. Possible, but I've never seen it do
> anything bad before.
> tmpwatch - Also, I've never seen it do anything bad
> yum.cron - possible could be sucking up all your memory. Disable it
> temporarily to see if it helps.
> slocate.cron - if it's turned on, and you have alot of disk, or possibly
> some network disk, this could do something. Also, if you never use
> "locate", then you don't need this.
>
> This brings us to my two most likely suspects.
> *If* you have logging turned on for your router, your log files can get
> pretty big. The last two, logwatch, and logrotate, both are accessing
> those log files, both could potentially be either sucking up too much
> memory, or even filling up the disk temporarily.
> Now, you can safely turn off logwatch. You just won't get your nightly
> e-mail of what was going on.
> But, if you turn off logwatch ... make sure you have some way of not
> letting your log files grow extra big.
>
> Or, you might have something else in your cron jobs that I don't know
> about. I'd look.
Greetings,
I'd like to thank everyone for their suggestions. At this point I have:
1) Trimmed /etc/cron.daily to:
logrotate prelink rpm tmpwatch yum.cron
2) Changed some lines in /etc/crontab to:
# 10 4 * * * root run-parts /etc/cron.daily
05 4 * * * root /etc/cron.daily/logrotate
08 4 * * * root /etc/cron.daily/prelink
10 4 * * * root /etc/cron.daily/rpm
12 4 * * * root /etc/cron.daily/tmpwatch
14 4 * * * root /etc/cron.daily/yum.cron
3) I have remote windows opened to top and df with updates every
5 minutes.
4) I also wrote a monitoring script that checks the status of
NFS and running processes. It logs the information to
/tmp and emails the information once per day at 8:00AM
Just some added information, the machine has 1 GB RAM and ~4 GB of swap,
so I doubt that it is running out of memory; the memory passes the BIOS
memory check. I would expect faulty memory to result in a complete
machine crash. (Could be wrong on that one though.) I still don't see
how everything stops working except routing. I would have assumed that
routing requires a running kernel?
Unfortunately, this is a wait and see method of debugging. I don't want
to resort to replacing the CPU/MB yet, but perhaps this is next on the
list. Part of a complete solution is to stop using the machine as a
router and use a port on a real router/switch to isolate the subnet.
Thank you for your suggestions,
doug
----------------------------------------------------------------------------
Doug Johnson email: [log in to unmask]
B390, Duane Physics (303)-492-4506 Office
Boulder, CO 80309 (303)-492-5119 FAX
http://www.aaccchildren.org
music is the greatest of the arts for me because it cuts through
everything, needs no aids. it is. it simply is.
----------------------------------------------------------------------------
|