LISTSERV - SCIENTIFIC-LINUX-USERS Archives

SCIENTIFIC-LINUX-USERS Archives

December 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

	LISTSERV Archives
	SCIENTIFIC-LINUX-USERS Home
	SCIENTIFIC-LINUX-USERS December 2006

	Log In
	Register

	Subscribe or Unsubscribe

	Search Archives

Options:	Use Monospaced Font Show Text Part by Default Show All Mail Headers
Message:	[<< First] [< Prev] [Next >] [Last >>]
Topic:	[<< First] [< Prev] [Next >] [Last >>]
Author:	[<< First] [< Prev] [Next >] [Last >>]

Subject:	Re: Problems with a Linux router/gateway SL303
From:	Troy Dawson <[log in to unmask]>
Reply To:	Troy Dawson <[log in to unmask]>
Date:	Wed, 27 Dec 2006 08:28:10 -0600
Content-Type:	text/plain
Parts/Attachments:	text/plain (110 lines)

Miles O'Neal wrote:
> Doug,
> 
> |I am at a loss as to how to debug the following problem. I have a
> |machine (Dual 800 MHz P3) running SL303 with all the latest available
> |patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
> |gateway between 2 networks, runs NIS and is an NFS server. I have also
> |installed fallback-reboot. Aside from that it is a pretty vanilla
> |machine. (I probably have more than 20 other machines that are identical
> |to this machine except that they are not gateways/routers.)
> |Unfortunately the machine is a bit unstable. It will run for between 1
> |and 4 days after which all access to the machine stops except that it
> |continues to pass packets between the 2 networks, i.e., routing works
> |and it answers pings, but NFS mounts, ssh, etc all fail. I can not even
> |connect to the fallback-reboot client on the machine which should be
> |sitting in memory. The machine runs in init level 3 and generally does
> |not have a monitor connected to it. But even if there is a monitor, once
> |it goes into this "failed" state the monitor is blank and does not
> |respond to key strokes. I also don't see anything unusual in the system
> |log files; the machine just stops logging errors at the time that it
> |fails. The failures generally (but not always) occur around 4AM, the
> |time that cron jobs run. Things that I have already tried:
> |
> |	1) Replaced Ethernet cards
> |	2) Replace system drive and reinstalled OS (disk was showing
> |           wear via smartd utilities).
> |	3) Rebooted using non SMP kernel
> |	4) Replaced power supply
> |	5) All disks in system pass fsck tests
> |
> |I don't know enough about how the router/gateway works to understand how
> |that can continue function when everything else seems to have stopped
> |working. Any thoughts would be greatly appreciated.
> 
> We have seen similar problems on compute and desktop
> servers for several years, running various versions
> of RH and now SL304.  It's gotten better but we still
> see it.  The things we have found to cause this include:
> 
> - memory problems (run diags)
> - mobo/cpu problems (can be hard to trace)
> - process/OS weirdness
> - just running out of memory
> 
> I would hope the latter wouldn't be a problem
> with a router, but it may be worth remotely
> connecting (rsh, ssh, whatever) and unning
> "top -d 1" on an xterm or something.  If it's
> anything like the same problems we have, the
> display will either just freeze, or the net
> will drop the connection and you'll see the
> last things top knew about.  This has helped
> us track down running out of memory and rogue
> processes.
> 
> -Miles

If it is somewhat regularly dying at 4:00 a.m., I would suspect 
something in the cron jobs.  I'd do a

ls /etc/cron.*

and see what's happening.
I'd then start removing things I either didn't really need, or that 
might be the problem.

Looking at one of my SL3 machines I have
/etc/cron.daily:
00-logwatch  logrotate        prelink  slocate.cron  yum.cron
0anacron     makewhatis.cron  rpm      tmpwatch

/etc/cron.weekly:
0anacron  makewhatis.cron

anacron - I wouldn't worry about.
makewhatis - goes through your man pages setting up the -k function. 
I'd take this out just cuz it really sucks up the CPU for the period of 
time, and do you man page searches on a different machine.
prelink - could be a problem. Possibly it's catching something bad.  But 
I've never seen this happen before.
rpm  - goes through your rpm's.  Possible, but I've never seen it do 
anything bad before.
tmpwatch - Also, I've never seen it do anything bad
yum.cron - possible could be sucking up all your memory.  Disable it 
temporarily to see if it helps.
slocate.cron - if it's turned on, and you have alot of disk, or possibly 
some network disk, this could do something.  Also, if you never use 
"locate", then you don't need this.

This brings us to my two most likely suspects.
*If* you have logging turned on for your router, your log files can get 
pretty big.  The last two, logwatch, and logrotate, both are accessing 
those log files, both could potentially be either sucking up too much 
memory, or even filling up the disk temporarily.
Now, you can safely turn off logwatch.  You just won't get your nightly 
e-mail of what was going on.
But, if you turn off logwatch ... make sure you have some way of not 
letting your log files grow extra big.

Or, you might have something else in your cron jobs that I don't know 
about.  I'd look.

Troy

-- 
__________________________________________________
Troy Dawson  [log in to unmask]  (630)840-6468
Fermilab  ComputingDivision/CSS  CSI Group
__________________________________________________

ATOM RSS1 RSS2

LISTSERV.FNAL.GOV