Trouble with Broadcom NetXtreme II and VMware ESXi

I faced today a really nasty problem. I have four HP ProLiant DL360 G6 in my lab. This server type has two 1 GbE NICs with the Broadcom NetXtreme II BCM5709 chip onboard, which are usually claimed by the bnx2 driver. While applying a host profile to three of the hosts, one hosts reported an error. Supposedly the host hasn’t a vmnic0 and because of this the host profile couldn’t be applied. Okay, quick check in the vSphere Web Client: Only three NICs. C# client showed the same result. Now it was interesting:

/var/log # esxcfg-nics -l
Name    PCI           Driver      Link Speed     Duplex MAC Address       MTU    Description
vmnic1  0000:02:00.01 bnx2        Up   1000Mbps  Full   00:26:55:7c:da:82 1500   Broadcom Corporation NC382i Integrated Multi Port PCI Express Gigabit Server Adapter
vmnic2  0000:04:00.00 bnx2        Up   1000Mbps  Full   68:b5:99:bc:6a:8c 1500   Broadcom Corporation NC382T PCI Express Dual Port Multifunction Gigabit Server Adapter
vmnic3  0000:04:00.01 bnx2        Up   1000Mbps  Full   68:b5:99:bc:6a:8e 1500   Broadcom Corporation NC382T PCI Express Dual Port Multifunction Gigabit Server Adapter
/var/log # lspci | grep vmnic
0000:02:00.0 Network controller: Broadcom Corporation NC382i Integrated Multi Port PCI Express Gigabit Server Adapter [vmnic0]
0000:02:00.1 Network controller: Broadcom Corporation NC382i Integrated Multi Port PCI Express Gigabit Server Adapter [vmnic1]
0000:04:00.0 Network controller: Broadcom Corporation NC382T PCI Express Dual Port Multifunction Gigabit Server Adapter [vmnic2]
0000:04:00.1 Network controller: Broadcom Corporation NC382T PCI Express Dual Port Multifunction Gigabit Server Adapter [vmnic3]

Okay… lspci shows four NICs, esxcfg-nics only three.

/var/log # ethtool -i vmnic0
driver: bnx2
version: 2.2.4f.v55.3
firmware-version: bc 5.2.3 NCSI 2.0.12
bus-info: 0000:02:00.0

Okay, vmnic0 is claimed by a driver. Quick check with another DL360 G6. Same firmware and driver. Lets dig deeper.

/var/log # grep vmnic0 *
shell.log:2014-04-22T12:17:25Z shell[35761]: [root]: grep vmnic0 *
vmkdevmgr.log:2014-04-22T12:14:28Z vmkdevmgr: AddAlias: Not commiting alias vmnic0 for busAddress p0000:02:00.0
vmkdevmgr.log:2014-04-22T12:14:28Z vmkdevmgr: AddAlias: skipping matching alias vmnic0 for pci device p0000:02:00.0 with assigned alias vmnic0
vmkernel.log:2014-04-22T12:14:28.981Z cpu0:33378)PCI: 1095: 0000:02:00.0 named 'vmnic0' (was '')
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)<6>bnx2 0000:02:00.0: vmnic0: Broadcom NetXtreme II BCM5709 1000Base-T (C0) PCI Express found at mem f2000000, IRQ 17, node addr 00:26:55:7c:da:80
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)<6>bnx2 0000:02:00.0: vmnic0: NetQueue Ops registered [0]
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)VMK_PCI: 395: Device 0000:02:00.0 name: vmnic0
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)Uplink: 6511: Device vmnic0 not yet opened
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)DMA: 612: DMA Engine 'vmnic0' created using mapper 'DMANull'.
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 8230: Opening device vmnic0
vmkernel.log:2014-04-22T12:14:33.431Z cpu3:32836)IRQ: 540: 0x39 <vmnic0-0> exclusive, flags 0x10
vmkernel.log:2014-04-22T12:14:33.431Z cpu3:32836)Uplink: 8111: Network device open handler failed for 'vmnic0': Failure
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 8260: Device vmnic0 failed to open
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 6807: Device vmnic0 not yet opened
vmkernel.log:2014-04-22T12:14:33.773Z cpu12:33407)<6>bnx2x: Added CNIC device: vmnic0
vmkernel.log:2014-04-22T12:14:33.773Z cpu12:33407)<3>bnx2x: vmnic0 - Num 1G iSCSI licenses = 65535

Ah, okay. That looks interesting:

vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)VMK_PCI: 395: Device 0000:02:00.0 name: vmnic0
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)Uplink: 6511: Device vmnic0 not yet opened
vmkernel.log:2014-04-22T12:14:33.429Z cpu12:33406)DMA: 612: DMA Engine 'vmnic0' created using mapper 'DMANull'.
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 8230: Opening device vmnic0
vmkernel.log:2014-04-22T12:14:33.431Z cpu3:32836)IRQ: 540: 0x39 <vmnic0-0> exclusive, flags 0x10
vmkernel.log:2014-04-22T12:14:33.431Z cpu3:32836)Uplink: 8111: Network device open handler failed for 'vmnic0': Failure
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 8260: Device vmnic0 failed to open
vmkernel.log:2014-04-22T12:14:33.431Z cpu12:33406)Uplink: 6807: Device vmnic0 not yet opened

At this point I asked Google and found a discussion in the VMTN, at which @VirtuallyMikeB had participated. Unfortunately the posted solution (power off the server and pull the power cables) didn’t helped (would have surprised me…). This solution was found in this blog article. Although this was not the solution, but it prompted me to start another attempt: A firmware update, because this may reset the NIC as well. I started the server from a USB stick with the current SPP 2014.02. The automatic firmware update updated the BIOS, the ILO board, NICs, the Smart Array controller, the whole damn server, every part of it. Okay, the server was a “bit” outdated… To make a long story short: The firmware update did the trick.

EDIT: And it seems that I’m not the only one…

A word of warning: Julian Wood wrote a blog article about a firmware update that kills Broadcom NICs in HP ProLiant G2 up to G7 servers. He also links to a customer advisory from HP. Following NICs are affected:

  • HP NC373T PCIe Multifunction Gig Server Adapter
  • HP NC373F PCIe Multifunction Gig Server Adapter
  • HP NC373i Multifunction Gigabit Server Adapter
  • HP NC374m PCIe Multifunction Adapter
  • HP NC373m Multifunction Gigabit Server Adapter
  • HP NC324i PCIe Dual Port Gigabit Server Adapter
  • HP NC326i PCIe Dual Port Gigabit Server Adapter
  • HP NC326m PCI Express Dual Port Gigabit Server Adapter
  • HP NC325m PCIe Quad Port Gigabit Server Adapter
  • HP NC320i PCIe Gigabit Server Adapter
  • HP NC320m PCI Express Gigabit Server Adapter
  • HP NC382i DP Multifunction Gigabit Server Adapter
  • HP NC382T PCIe DP Multifunction Gigabit Server Adapter
  • HP NC382m DP 1GbE Multifunction BL-c Adapter
  • HP NC105i PCIe Gigabit Server Adapter

Don’t update the affected NICs with the HP Smart Update Manager (HP SUM) or the HP Service Pack for ProLiant (HP SPP) 2014.2.0. If you update one of the affected NICs with the firmware smart component be sure to avoid updating the Comprehensive Configuration Management (CCM) firmware to version 7.8.21.

EDIT: Hewlett-Packard published HP Service Pack for ProLiant (SPP) Version 2014.02.0(B), which addresses several issues, not only the Issue with Broadcom NICs. This is taken from the HP website:

This updated version of the SPP was released to address the OpenSSL issue.  See HPN Customer Notice: OpenSSL HeartBleed Vulnerability.  Additionally for Red Hat Enterprise Linux 6 customers, please reference the Red Hat knowledge base article, OpenSSL CVE-2014-0160.  Products affected:

  • HP Onboard Administrator for Windows and Linux version 4.12 replaced 4.11
  • HP System Management Homepage for Windows and Linux version 7.3.2 replaced 7.3.1.4
  • HP Integrated Lights-Out 2 for Windows and Linux version 2.25 replaced 2.23
  • HP BladeSystem c-Class Virtual Connect Firmware, Ethernet plus 4/8Gb 20-port and 8Gb 24-port FC Edition Component for Windows and Linux version 4.10(b) replaced 4.10
  • HP Smart Update Manager version 6.3.1 replaced 6.2.0

This release also resolves the Broadcom Comprehensive Configuration Management Firmware issue with version 7.8.21 found in the Service Pack for ProLiant 2014.02.0.  See Customer Advisory c04258304 for additional information.

Thanks to Rotem Agmon, who has posted a comment with this information.