Recently we’ve been seeing a lot of error messages while using the Broadcom BCM570x series of Gigabit Ethernet adapters under SUSE Linux Enterprise Server 9. The symptoms are that the interface will simply hang under high traffic and refuse to pass more packets, eventually giving the error:
Dec 1 01:17:46 dev03 kernel: NETDEV WATCHDOG: eth2: transmit timed out
Dec 1 01:17:46 dev03 kernel: tg3: eth2: transmit timed out, resetting
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2c00 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=3400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=2400 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=1800 enable_bit=2
Dec 1 01:17:46 dev03 kernel: tg3: tg3_stop_block timed out, ofs=4800 enable_bit=2
It’s become a very serious issue for us because we have Broadcom BCM570x controllers on board all of our HP-DL380 servers. The problem seems to occur more frequently now that we’ve upgraded an SP2 (and beyond) SLES9 kernel, although we have had problems dating back several months with older kernels.
Doing some research on the Internet, I’ve found that this is a very common problem out in the field. In a summary document I prepared to management, I wrote the following:
Other customers in the field have reported the same problems running RedHat Enterprise Server 3, Debian GNU Linux, FreeBSD/NetBSD and even Novell Netware (internal communication with Novell PSE). In many of the reported incidents, customers were running identical server hardware (HP/Compaq Proliant DL-3×0 series) to CBC.ca. [HP IT Resource Centre thread #898761 where customers have reported issues with a variety of HP hardware and operating systems.]
There are a number of root causes to the problem including Linux driver instability (the Tigon3 (tg3) driver was created by reverse-engineering the Broadcom bcm5700 driver due to the low quality of the latter) and manufacturing defects (manufacturing defects with some Broadcom 5704 chips afflicted Sun’s initial customer shipment of Sun Fire V210 and V240 servers in 2003 leading to Sun Alert #55620 The impact of such defects beyond Sun is unclear because Broadcom refused to provide further details.)
Right now, we’re awaiting feedback from HP and Novell on how they plan to resolve this issue. In the meantime, we’re going to stockpile some Intel Gigabit Ethernet cards.