SCIENTIFIC-LINUX-USERS Archives

December 2006

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Stephen John Smoogen <[log in to unmask]>
Reply To:
Stephen John Smoogen <[log in to unmask]>
Date:
Wed, 27 Dec 2006 13:25:12 -0700
Content-Type:
text/plain
Parts/Attachments:
text/plain (62 lines)
On 12/26/06, Doug Johnson <[log in to unmask]> wrote:
> Greetings,
>
> I am at a loss as to how to debug the following problem. I have a
> machine (Dual 800 MHz P3) running SL303 with all the latest available
> patches; kernel version 2.4.21-47.0.1.EL. The machine serves as a
> gateway between 2 networks, runs NIS and is an NFS server. I have also
> installed fallback-reboot. Aside from that it is a pretty vanilla
> machine. (I probably have more than 20 other machines that are identical
> to this machine except that they are not gateways/routers.)
> Unfortunately the machine is a bit unstable. It will run for between 1
> and 4 days after which all access to the machine stops except that it
> continues to pass packets between the 2 networks, i.e., routing works
> and it answers pings, but NFS mounts, ssh, etc all fail. I can not even
> connect to the fallback-reboot client on the machine which should be
> sitting in memory. The machine runs in init level 3 and generally does
> not have a monitor connected to it. But even if there is a monitor, once
> it goes into this "failed" state the monitor is blank and does not
> respond to key strokes. I also don't see anything unusual in the system
> log files; the machine just stops logging errors at the time that it
> fails. The failures generally (but not always) occur around 4AM, the
> time that cron jobs run. Things that I have already tried:
>
>         1) Replaced Ethernet cards
>         2) Replace system drive and reinstalled OS (disk was showing
>            wear via smartd utilities).
>         3) Rebooted using non SMP kernel
>         4) Replaced power supply
>         5) All disks in system pass fsck tests
>

Ok this sounds like a bad mobo or some BIOS setting that is putting
the box in a 'sleep' state. I don't know what kind of keyboard your
system is but I have seen this on a box with a USB keyboard/mouse set.


> I don't know enough about how the router/gateway works to understand how
> that can continue function when everything else seems to have stopped
> working. Any thoughts would be greatly appreciated.
>

Well in the old 2.2 days, the kernel would keep passing data over
ethernet casrds even if everything else was stopped. However in the
2.4 kernel this was not there anymore. If the packets are still being
passed, then the system isn't completely crashed, but may be in an
indeterminate crash state.. the keyboard/mouse sound like a bad
APMD/ACPI mode for a motherboard. I would try getting a BIOS update
for the mobo, and then I would look at booting with the apmd and other
daemons turned off. THen I would try the kernel acpi=off mode.

If the problem still occurs but had not occurred previously I would
suspect that the motherboard is having issues. One of the problems I
had in the past was a system with a bad powersupply.. that affectred
the mobo so that once we replaced the powersupply the mobo still acted
up because it had been damaged.


-- 
Stephen J Smoogen. -- CSIRT/Linux System Administrator
How far that little candle throws his beams! So shines a good deed
in a naughty world. = Shakespeare. "The Merchant of Venice"

ATOM RSS1 RSS2