During the replacement of some VMware ESXi hosts at a customer, I discovered a recurrent failure of the vSphere Distributed Switch health checks. A VLAN and MTU mismatch was reported. On the physical side, the ESXi hosts were connected to two HPE 5820 switches, that were configured as an IRF stack. Inside the VMware bubble, the hosts were sharing a vSphere Distributed Switch.
The switch ports of the old ESXi hosts were configured as Hybrid ports. The switch ports of the new hosts were configured as Trunk ports, to streamline the switch and port configuration.
Some words about port types
Comware knows three different port types:
If you were familiar with Cisco, you will know Access and Trunk ports. If you were familiar with HPE ProCurve or Alcatel-Lucent Enterprise, these two port types refer to untagged and tagged ports.
So what is a Hybrid port? A Hybrid port can belong to multiple VLANs where they can be untagged and tagged. Yes, multiple untagged VLANs on a port are possible, but the switch will need additional information to bridge the traffic into correct untagged VLANs. This additional information can be MAC addresses, IP addresses, LLDP-MED etc. Typically, hybrid ports are used for in VoIP deployments.
The benefit of a Hybrid port is, that I can put the native VLAN of a specific port, which is often referred as Port VLAN identifier (PVID), as a tagged VLAN on that port. This configuration allows, that all dvPortGroups have a VLAN tag assigned, even if the VLAN tag represents the native VLAN of a switch port.
Failing health checks
A failed health check rises a vCenter alarm. In my case, a VLAN and MTU alarm was reported. In both cases, VLAN 1 was causing the error. According to VMware, the three main causes for failed health checks are:
- Mismatched VLAN trunks between a vSphere distributed switch and physical switch
- Mismatched MTU settings between physical network adapters, distributed switches, and physical switch ports
- Mismatched virtual switch teaming policies for the physical switch port-channel settings.
Let’s take a look at the port configuration on the Comware switch:
# interface Ten-GigabitEthernet1/0/9 port link-mode bridge description "ESX-05 NIC1" port link-type trunk port trunk permit vlan all stp edged-port enable #
As you can see, this is a normal trunk port. All VLANs will be passed to the host. This is an except from the display interface Ten-GigabitEthernet1/0/9 output:
PVID: 1 Mdi type: auto Port link-type: trunk VLAN passing : 1(default vlan), 2-3, 5-7, 100-109 VLAN permitted: 1(default vlan), 2-4094 Trunk port encapsulation: IEEE 802.1q
The native VLAN is 1, this is the default configuration. Traffic, that is received and sent from a trunk port, is always tagged with a VLAN id of the originating VLAN – except traffic from the default (native) VLAN! This traffic is sent without a VLAN tag, and if frames were received with a VLAN tag, this frames will be dropped!
If you have a dvPortGroup for the default (native) VLAN, and this dvPortGroup is sending tagged frames, the frames will be dropped if you use a “standard” trunk port. And this is why the health check fails!
Ways to resolve this issue
In my case, the dvPortGroup was configured for VLAN 1, which is the default (native) VLAN on the switch ports.
There are two ways to solve this issue:
- Remove the VLAN tag from the dvPortGroup configuration
- Change the PVID for the trunk port
To change the PVID for a trunk port, you have to enter the following command in the interface context:
[ToR-Ten-GigabitEthernet1/0/9] port trunk pvid vlan 999
You have to change the PVID on all ESXi facing switch ports. You can use a non-existing VLAN ID for this.
vSphere Distributed Switch health check will switch to green for VLAN and MTU immediately.
Please note, that this is not the solution for all VLAN-related problems. You should make sure that you are not getting any side effects.
- Exchange HCW8078 – Migration Endpoint could not be created - November 1, 2020
- Moving a small on-prem environment to Azure/ O365 – Part 2 - October 26, 2020
- Exchange Control Panel /ecp broken after certificate replacement - October 23, 2020