SCIENTIFIC-LINUX-USERS Archives

December 2013

SCIENTIFIC-LINUX-USERS@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Paul Robert Marino <[log in to unmask]>
Reply To:
Paul Robert Marino <[log in to unmask]>
Date:
Thu, 26 Dec 2013 13:00:37 -0500
Content-Type:
text/plain
Parts/Attachments:
text/plain (134 lines)
This was caused by an internal hardware watchdog built into Intel
network cards, it detected an error and disabled the interface on the
hardware level until you rebooted and the cards memory was cleared. It
looks like the card may have lost clock sync with its neighbor which
is odd that basically means it wasn't sending out the 5 volt signal
used for frequency sync. I've worked with Intel cards for probably
over a decade and Ive never seen this exact error before.

Try rolling back to the previous kernel version however this looks
more like it may be a physical hardware issue.

On Thu, Dec 26, 2013 at 12:40 PM, Galante, Nicola
<[log in to unmask]> wrote:
> Greetings,
>
> I administer a web server for my institution and last night we had a
> problem.  The server is a 1U Intel Xeon E5620 machine.  The on-board network
> interface is an Intel 82574L Gigabit Controller.  Scientific Linux 6.4,
> kernel 2.6.32-431.1.2.el6.x86_64.  At some point last night the network
> interface stopped working giving a backtrace on dev_watchdog.  I could not
> restart the service network, it complained that the interface eth0 was not
> available.  I tried to reconfigure it with NetworkManager, unsuccessfully.
> A full system reboot fixed the problem, although I couldn't identify the
> problem.  I do not know if this matters, but this problem never occurred
> before the last yum update.  Here below the portion of /var/log/messages
> that relates to the problem
>
> =================================================================
> Dec 25 20:01:52 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20943
> duration=0(sec)
> Dec 25 20:02:21 veritasm xinetd[1966]: START: nrpe pid=20947
> from=::ffff:199.104.151.131
> Dec 25 20:02:21 veritasm xinetd[1966]: EXIT: nrpe status=0 pid=20947
> duration=0(sec)
> Dec 26 02:18:37 veritasm kernel: ------------[ cut here ]------------
> Dec 26 02:18:37 veritasm kernel: WARNING: at net/sched/sch_generic.c:261
> dev_watchdog+0x26b/0x280() (Not tainted)
> Dec 26 02:18:37 veritasm kernel: Hardware name: X8DTL
> Dec 26 02:18:37 veritasm kernel: NETDEV WATCHDOG: eth0 (e1000e): transmit
> queue 0 timed out
> Dec 26 02:18:37 veritasm kernel: Modules linked in: autofs4 8021q sunrpc
> garp stp llc cpufreq_ondemand acpi_cpufreq freq_table mperf ipt_REJECT
> nf_conntrack_ipv4 nf_defrag_ipv4 iptable_filter ip_tables ip6t_REJECT
> nf_conntrack_ipv6 nf_defrag_ipv6 xt_state nf_conntrack ip6table_filter
> ip6_tables ipv6 microcode iTCO_wdt iTCO_vendor_support sg i2c_i801 i2c_core
> lpc_ich mfd_core e1000e ptp pps_core ioatdma dca i7core_edac edac_core
> shpchp ext4 jbd2 mbcache raid1 sr_mod cdrom sd_mod crc_t10dif pata_acpi
> ata_generic ata_piix dm_mirror dm_region_hash dm_log dm_mod [last unloaded:
> scsi_wait_scan]
> Dec 26 02:18:37 veritasm kernel: Pid: 130, comm: kipmi0 Not tainted
> 2.6.32-431.1.2.el6.x86_64 #1
> Dec 26 02:18:37 veritasm kernel: Call Trace:
> Dec 26 02:18:37 veritasm kernel: <IRQ>  [<ffffffff81071e27>] ?
> warn_slowpath_common+0x87/0xc0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff81071f16>] ?
> warn_slowpath_fmt+0x46/0x50
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8147b75b>] ?
> dev_watchdog+0x26b/0x280
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8105dd5c>] ?
> scheduler_tick+0xcc/0x260
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8147b4f0>] ?
> dev_watchdog+0x0/0x280
> Dec 26 02:18:37 veritasm kernel: [<ffffffff81084b07>] ?
> run_timer_softirq+0x197/0x340
> Dec 26 02:18:37 veritasm kernel: [<ffffffff810ac905>] ?
> tick_dev_program_event+0x65/0xc0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8107a8e1>] ?
> __do_softirq+0xc1/0x1e0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff810ac9da>] ?
> tick_program_event+0x2a/0x30
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8100c30c>] ?
> call_softirq+0x1c/0x30
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8100fa75>] ? do_softirq+0x65/0xa0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8107a795>] ? irq_exit+0x85/0x90
> Dec 26 02:18:37 veritasm kernel: [<ffffffff815310ba>] ?
> smp_apic_timer_interrupt+0x4a/0x60
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8100bb93>] ?
> apic_timer_interrupt+0x13/0x20
> Dec 26 02:18:37 veritasm kernel: <EOI>  [<ffffffff8152a367>] ?
> _spin_unlock_irqrestore+0x17/0x20
> Dec 26 02:18:37 veritasm kernel: [<ffffffff812e7790>] ?
> ipmi_thread+0x70/0x1c0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff812e7720>] ?
> ipmi_thread+0x0/0x1c0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8109af06>] ? kthread+0x96/0xa0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8100c20a>] ? child_rip+0xa/0x20
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8109ae70>] ? kthread+0x0/0xa0
> Dec 26 02:18:37 veritasm kernel: [<ffffffff8100c200>] ? child_rip+0x0/0x20
> Dec 26 02:18:37 veritasm kernel: ---[ end trace fc057a7fca6eff49 ]---
> Dec 26 02:18:37 veritasm kernel: e1000e 0000:06:00.0: eth0: Reset adapter
> unexpectedly
> Dec 26 02:18:37 veritasm NetworkManager[1724]: <info> (eth0): carrier now
> OFF (device state 8, deferring action for 4 seconds)
> Dec 26 02:18:38 veritasm kernel: e1000e 0000:06:00.0: eth0: Timesync Tx
> Control register not set as expected
> Dec 26 02:18:41 veritasm NetworkManager[1724]: <info> (eth0): device state
> change: 8 -> 2 (reason 40)
> Dec 26 02:18:41 veritasm NetworkManager[1724]: <info> (eth0): deactivating
> device (reason: 40).
> Dec 26 02:18:43 veritasm ntpd[1974]: Deleting interface #5 eth0,
> 199.104.151.141#123, interface stats: received=789, sent=886, dropped=0,
> active_time=232100 secs
> Dec 26 08:58:36 veritasm kernel: fuse init (API version 7.13)
> Dec 26 08:58:36 veritasm rtkit-daemon[2379]: Sucessfully made thread 23005
> of process 23005 (/usr/bin/pulseaudio) owned by '500' high priority at nice
> level -11.
> Dec 26 08:58:37 veritasm rtkit-daemon[2379]: Sucessfully made thread 23039
> of process 23039 (/usr/bin/pulseaudio) owned by '500' high priority at nice
> level -11.
> Dec 26 08:58:37 veritasm pulseaudio[23039]: pid.c: Daemon already running.
> =====================================================================
>
> and the dmesg portion too
>
> ======================================================================
> e1000e 0000:06:00.0: eth0: registered PHC clock
> e1000e 0000:06:00.0: eth0: (PCI Express:2.5GT/s:Width x1) 00:25:90:c2:ec:00
> e1000e 0000:06:00.0: eth0: Intel(R) PRO/1000 Network Connection
> e1000e 0000:06:00.0: eth0: MAC: 3, PHY: 8, PBA No: 0101FF-0FF
> ADDRCONF(NETDEV_UP): eth0: link is not ready
> e1000e: eth0 NIC Link is Up 1000 Mbps Full Duplex, Flow Control: None
> ADDRCONF(NETDEV_CHANGE): eth0: link becomes ready
> 8021q: adding VLAN 0 to HW filter on device eth0
> eth0: no IPv6 routers present
> NETDEV WATCHDOG: eth0 (e1000e): transmit queue 0 timed out
> e1000e 0000:06:00.0: eth0: Reset adapter unexpectedly
> e1000e 0000:06:00.0: eth0: Timesync Tx Control register not set as expected
> =======================================================================
>
> Hope somebody has an idea of where the problem might be.
>
> Regards,
> Nicola Galante

ATOM RSS1 RSS2