Demystifying "Interfaces on which heartbeats are not seen"

By accident, I found a heartbeat/ VLAN issue on a NetScaler cluster at one of my customers. The NetScaler ADC appliances have three interfaces connected to a switch stack. Two of the three interfaces were configured as a channel (LAG). This is a snippet from the config:

set channel LA/1 -tagall ON -throughput 0 -lrMinThroughput 0 -bandwidthHigh 0 -bandwidthNormal 0
...
bind vlan 10 -ifnum 1/3
bind vlan 10 -ifnum LA/1 -tagged
bind vlan 54 -ifnum LA/1 -tagged
bind vlan 55 -ifnum LA/1 -tagged

On the switch stack, the port to which interface 1/3 is connected, is configured as an access port. The ports, to which the channel is connected, is configured as a trunk port with some permitted VLANs. The customer is using HPE Comware based switches. The terminology is the same for Cisco. If you use HPE ProVision or Alcatel Lucent Enterprise, translate “access” to “untagged” and “trunk” to “tagged”. Because the channel is configured as a trunk port on the switch, the tagall option was set.

Issue

While examining the output of  show ha node I saw this:

Interfaces on which heartbeats are not seen : LA/1

Because interface 1/3 was not affected, this had to be a VLAN issue. During the initial troubleshooting, I was able to discover heartbeat packets in VLAN 1 and in VLAN 10.

Solution

The solution was easy: Remove the tagged option for VLAN 10 on LA/1.

bind vlan 10 -ifnum LA/1

instead of

bind vlan 10 -ifnum LA/1 -tagged

Because of the configured tagall  option, all packets sourced by LA/1 are tagged with the corrosponding VLAN ID. But because it’s now explicitly configured without a tag for VLAN 10, VLAN 10 is now also the native VLAN for LA/1.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, **native vlan=10**, MAC=02:e0:ed:38:9d:d2, uptime 1362h58m51s

Now the NetScaler was sending heartbeat packets with a tag for VLAN 10, and the issue was solved.

Explanation

Heartbeat packets are always send without a VLAN tag (untagged). There are two exceptions:

  • The NSVLAN is configured with a specific VLAN ID, or
  • an interface used for hearbeats is configured with the tagall

In this case, the heartbeat packets are tagged with the ID of the native VLAN ID of the interface. A show interface of the channel showed, that the channel was using VLAN 1 as the native VLAN.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, **native vlan=1**, MAC=02:e0:ed:38:9d:d2, uptime 1362h55m13s

How does the NetScaler determine the native VLAN for an interface? The native VLAN is the VLAN, to which an interface is bound untagged. An interface can only be bound untagged to a single VLAN. But it can be bound tagged to multiple VLANs.

If you take a look at the config snippet at the top of this blog post, you might notice, that interface 1/3 is bound untagged to VLAN 10. So this is the native VLAN for interface 1/3. But this interface is not using the tagall  option. Therefore, heartbeat packets are not tagged. The channel LA/1 is bound tagged to VLAN 10. But it was also bound to VLAN 1, without the tagged  option. This caused, that VLAN 1 was used as the native VLAN for channel LA/1. And because LA/1 is configured with the tagall  option, the heartbeats were tagged with a tag for VLAN 1. That’s why I was able to see the heartbeats, that were send over channel LA/1, in VLAN 1.

In the end, the NetScaler appliances were sending heartbearts from interface 1/3 to VLAN 10, and from channel LA/1 to VLAN 1. This caused the message “Interfaces on which heartbeats are not seen: LA/1”.