SCIENTIFIC-LINUX-USERS Archives

October 2011

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Vladimir Mosgalin <[log in to unmask]>
Reply To:
Vladimir Mosgalin <[log in to unmask]>
Date:
Thu, 6 Oct 2011 05:39:49 +0400
Content-Type:
text/plain
Parts/Attachments:
text/plain (58 lines)
Hi James Kelly!

 On 2011.10.05 at 22:31:18 +0100, James Kelly wrote next:

> I lost contact with my Scientific Linux 6.1 KVM host earlier today.
> 
> The machine is headless and I don't have any IPMI stuff on the machine so I
> had to plug a monitor into it. However, there was no life from the monitor
> and I pressed the reset button.
> 
> It seems to me that the networking died. The machine is booted first thing
> every morning (so the 9:00am start was missed by two minutes!) and the
> networking error seems to have occurred about 27 minutes after
> the initial boot.

It's unclear to me if tg3 driver errors in the second half of message
are source or cause of this situation, however if they are source, you
might be interested in recent update that Red Hat has released:
http://rhn.redhat.com/errata/RHEA-2011-1348.html

Try installing kmod-tg3 from sl-fastbugs repo and rebooting, it should
make your system use newer version of network driver that's mentioned in
these messages. I have no idea if it will really help, but it probably
won't hurt to try.


The often cause of similar problems with network drivers could be
interrupt setup - network cards generate lots of interrupts under load
and use various advanced features to ease it a bit, I saw some
situations where panics and warnings in kernel appeared due to hardware
interrupt setup or buggy interrupt code in network driver under load.
Just in case, you might want to find mention of eth in /proc/interrupts
to make sure that it uses MSI-X (shown as PCI-MSI-edge or PCI-MSI-X) and
not IO-APIC-level or something like that. However, I don't think these
kind of problems should arrive on such hardware.

In the worst case, if these problems will keep appearing, consider
installing external intel-based network card, these work most flawlessly
under Linux in my (and some other people) experience. It's kind of sad,
but marvell, broadcom and nvidia products are a bit of second class
citizens and don't always work flawlessly under load - might be more of
a driver problem, who knows, but that's just my experience from past
years.
(also, I'd definitely stay away from NICs based on other manufacturer's
chips, except for these 4 nothing else should probably be allowed in
server market. YMMV)

These messages also can be indicating something else than network
problems but people with deeper kernel knowledge than me should answer
this. All I can say is that NICs+network drivers+interrupt settings
combination *can* be real source of problems, up to kernel panics under
some conditions, it's not that rare at all to find out that such
problems are caused by network driver.

-- 

Vladimir

ATOM RSS1 RSS2