SCIENTIFIC-LINUX-USERS Archives

December 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Doug Johnson <[log in to unmask]>
Reply To:
Doug Johnson <[log in to unmask]>
Date:
Tue, 26 Dec 2006 09:18:17 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (45 lines)
Greetings,

I am at a loss as to how to debug the following problem. I have a
machine (Dual 800 MHz P3) running SL303 with all the latest available
patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
gateway between 2 networks, runs NIS and is an NFS server. I have also
installed fallback-reboot. Aside from that it is a pretty vanilla
machine. (I probably have more than 20 other machines that are identical
to this machine except that they are not gateways/routers.)
Unfortunately the machine is a bit unstable. It will run for between 1
and 4 days after which all access to the machine stops except that it
continues to pass packets between the 2 networks, i.e., routing works
and it answers pings, but NFS mounts, ssh, etc all fail. I can not even
connect to the fallback-reboot client on the machine which should be
sitting in memory. The machine runs in init level 3 and generally does
not have a monitor connected to it. But even if there is a monitor, once
it goes into this "failed" state the monitor is blank and does not
respond to key strokes. I also don't see anything unusual in the system
log files; the machine just stops logging errors at the time that it
fails. The failures generally (but not always) occur around 4AM, the
time that cron jobs run. Things that I have already tried:

	1) Replaced Ethernet cards
	2) Replace system drive and reinstalled OS (disk was showing
           wear via smartd utilities).
	3) Rebooted using non SMP kernel
	4) Replaced power supply
	5) All disks in system pass fsck tests

I don't know enough about how the router/gateway works to understand how
that can continue function when everything else seems to have stopped
working. Any thoughts would be greatly appreciated.

	Thanks,
	doug

---------------------------------------------------------------------------- 
   Doug Johnson                    email: [log in to unmask]        
   B390, Duane Physics             (303)-492-4506 Office                     
   Boulder, CO 80309               (303)-492-5119 FAX                        
                                   http://www.aaccchildren.org               
   If you smile at me, you know I will understand. 
   'Cause that's something everybody, everywhere does in the same language. 
----------------------------------------------------------------------------

ATOM RSS1 RSS2