SCIENTIFIC-LINUX-DEVEL Archives

February 2009

SCIENTIFIC-LINUX-DEVEL@LISTSERV.FNAL.GOV

Options: Use Monospaced Font
Show Text Part by Default
Show All Mail Headers

Message: [<< First] [< Prev] [Next >] [Last >>]
Topic: [<< First] [< Prev] [Next >] [Last >>]
Author: [<< First] [< Prev] [Next >] [Last >>]

Print Reply
Subject:
From:
Steven Timm <[log in to unmask]>
Reply To:
Steven Timm <[log in to unmask]>
Date:
Thu, 12 Feb 2009 09:21:14 -0600
Content-Type:
TEXT/PLAIN
Parts/Attachments:
TEXT/PLAIN (103 lines)
There are a large number of worker nodes in the CDF Grid cluster
which have been experiencing intermittent network failures.  They
are nevertheless catastrophic when they happen and they render
the cluster unusable because the network card gives out in the
middle of a tcp socket transmission and leaves the central node hung
waiting for the tcp to finish, which never happens.  The failures
are load-related.

This error happened from time to time over the life of these nodes
but they were operating more or less stably under LTS45/x86_64, tg3 3.77
driver, and have been fairly stable over the three years they've been here.  4 
racks of identical hardware in D0 clusters are still running
that version of the kernel and not having problems.

As I said above, current version is LTS47,
2.6.9-78.0.1.ELsmp, x86_64 which includes v3.86 of the broadcom tg3
kernel driver.  Details of the error message are below.

CDF has 237 of these nodes, D0 has 180 or so.  Neither experiment
can afford to have that many nodes out of production.  We have done
a lot of googling and seen many other people who have the same problem
but nobody with a solution yet.  Worse, it appears that the same problem
is likely to exist in SL5 as well, as well several other distros.

NETDEV WATCHDOG: eth1: transmit timed out
tg3: eth1: transmit timed out, resetting
tg3: DEBUG: MAC_TX_STATUS[00000008] MAC_RX_STATUS[00000000]
tg3: DEBUG: RDMAC_STATUS[00000000] WDMAC_STATUS[00000010]
tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
tg3: tg3_stop_block timed out, ofs=4c00 enable_bit=2
tg3: eth1: Link is down.
tg3: eth1: Link is up at 1000 Mbps, full duplex.
tg3: eth1: Flow control is off for TX and off for RX.


--------------
From lspci -v
01:05.1 Ethernet controller: Broadcom Corporation NetXtreme BCM5704 Gigabit 
Ethernet (rev 10)
         Subsystem: Super Micro Computer Inc: Unknown device 1648
         Flags: bus master, 66Mhz, medium devsel, latency 64, IRQ 193
         Memory at fc9e0000 (64-bit, non-prefetchable) [size=64K]
         Expansion ROM at <ignored> [disabled]
         Capabilities: [40] PCI-X non-bridge device.
         Capabilities: [48] Power Management version 2
         Capabilities: [50] Vital Product Data
         Capabilities: [58] Message Signalled Interrupts: 64bit+ Queue=0/3 
Enable-

dmesg from the bootup:

tg3.c:v3.86 (November 9, 2007)
ACPI: PCI Interrupt 0000:01:05.0[A] -> GSI 29 (level, low) -> IRQ 185
divert: allocating divert_blk for eth0
eth0: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) 
10/100
/1000Base-T Ethernet 00:30:48:76:ec:4e
eth0: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[1] WireSpeed[1] TSOcap[0]
eth0: dma_rwctrl[769f4000] dma_mask[64-bit]
ACPI: PCI Interrupt 0000:01:05.1[B] -> GSI 30 (level, low) -> IRQ 193
divert: allocating divert_blk for eth1
eth1: Tigon3 [partno(BCM95704A6) rev 2100 PHY(5704)] (PCIX:133MHz:64-bit) 
10/100
/1000Base-T Ethernet 00:30:48:76:ec:4f
eth1: RXcsums[1] LinkChgREG[0] MIirq[0] ASF[0] WireSpeed[1] TSOcap[1]
eth1: dma_rwctrl[769f4000] dma_mask[64-bit]

[root@fcdfcaf853 ~]# ethtool -i eth1
driver: tg3
version: 3.86
firmware-version: 5704-v3.36
bus-info: 0000:01:05.1

Not clear if this is the latest network firmware or not.
--------------------------

The hints we have got thus far is that the new driver v3.86 is
trying to implement one of the tcp offload features on the NIC
which in the case of the BCM5704 is broken.  There is some talk
that a SCSI controller on the same PCI bus as the BCM5704 can cause 
trouble but these boards have no SCSI controller.


Any help is appreciated.

Steve Timm







-- 
------------------------------------------------------------------
Steven C. Timm, Ph.D  (630) 840-8525
[log in to unmask]  http://home.fnal.gov/~timm/
Fermilab Computing Division, Scientific Computing Facilities,
Grid Facilities Department, FermiGrid Services Group, Assistant Group Leader.

ATOM RSS1 RSS2