Tag Archives: troubleshooting

NetScaler Gateway – Cannot complete your request

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer reported a weird problem with his NetScaler Gateway. Upon the first load of the website, they got an error “Cannot complete your request”. After clicking OK the error disappeared and does not occured again after reloading the website. Only after closing and re-opening the browser. I got this message in Firefox and Internet Explorer, but not from a remote machine, e.g. my PC at the office.

I found no configuration error or something, that would have explained this message. Finally, I found something that caught my attention:

HTTP/1.1 412 Precondition Failed

I found this using the Firefox Web Development Tools (I only had a Firefox and IE on my remote machine). With this message I found CTX244520 which also explained this error. The issue is caused by a hidden feature for caching web site data of the Gateway vServer. If you don’t have Integrated Cache feature licensed or enable, this feature failes. It is called Static Page Caching.

My customer is currently running NS12.0 60.10, and this issue is fixed in 12.0 61.8. And the customer is using a custom theme, which is based on one of the included themes.

If possible you can enable Integrated Caching. If you can’t enable Integrated Caching, you can simple disable this feature:

   show aaa parameter
   Configured AAA parameters
           EnableStaticPageCaching: YES
           EnableEnhancedAuthFeedback: NO
           DefaultAuthType: LOCAL  MaxAAAUsers: 1000
           AAAD nat ip: None
           EnableSessionStickiness : NO
           aaaSessionLoglevel : INFORMATIONAL
           AAAD Log Level : INFORMATIONAL
           Dynamic address: OFF
           GUI mode: ON
           Max Saml Deflate Size: 1024
    Done
   set aaa parameter -enableStaticPageCaching NO
    Done

Out of space – first steps when a datastore runs out of space

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

This is a situation that never should happen, and I had to deal with it only a couple of times in more than 10y working with VMware vSphere/ ESXi. In most cases, the reason for this was the usage of thin-provisioned disks together with small datastores. Yes, that’s a bad design. Yes, this should never happen.

There is a nearly 100% chance that this setup will fail one day. Either because someone dumps much data into the VMs, or because of VM snapshots. But such a setip WILL FAIL one day.

Yesterday was one of these days and five VMs have stopped working on a small ESXi in a site of one of my customers. A quick look into the vCenter confirmed my first assumption. The datastore was full. My second thought: Why are there so many VMs on that small ESXi host, and why they are thin-provisioned?

The vCenter showed me the following message on each VM:

There is no more space for virtual disk $VMNAME.vmdk. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

Okay, what to do? First things first:

  1. Is there any unallocated space left on the RAID group? If yes, expand the VMFS.
  2. Are there any VM snapshots left? If yes, remove them
  3. Configure 100% memory reservation for the VMs. This removes the VM memory swap files and releases a decent amout of disk space
  4. Remove ISO files from the datastore
  5. Remove VMs (if you have a backup and they are not necessary for the business)

This should allow you to continue the operation of the VMs. To solve the problem permanently:

  1. Add disks to the server and expand the VMFS, or create a new datastore
  2. Add a NFS datastore
  3. Remove unnecessary VMs
  4. Setup a working monitoring , setup alarms, do not overprovision datastores, or switch to eager-zeroed disks

Such an issues should not happen. It is not rude to say here: This is simply due to bad design and lack of operational processes.

Windows NPS – Authentication failed with error code 16

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, a customer called me and reported, on the first sight, a pretty weired error: Only Windows clients were unable to login into a WPA2-Enterprise wireless network. The setup itself was pretty simple: Cisco Meraki WiFi access points, a Windows Network Protection Server (NPS) on a Windows Server 2016 Domain Controller, and a Sophos SG 125 was acting as DHCP for different WiFi networks.

Pixybay / pixabay.com/ Pixabay License

Windows clients failed to authenticate, but Apple iOS, Android, and even Windows 10 Tablets had no problem.

The following error was logged into the Windows Security event log.

Authentication Details:
Connection Request Policy Name: Use Windows authentication for all users
Network Policy Name: Wireless Users
Authentication Provider: Windows
Authentication Server: domaincontroller.domain.tld
Authentication Type: PEAP
EAP Type: -
Account Session Identifier: -
Logging Results: Accounting information was written to the local log file.
Reason Code: 16
Reason: Authentication failed due to a user credentials mismatch. Either the user name provided does not map to an existing user account or the password was incorrect.

The credentials were definitely correct, the customer and I tried different user and password combinations.

I also checked the NPS network policy. When choosing PEAP as authentication type, the NPS needs a valid server certificate. This is necessary, because the EAP session is protected by a TLS tunnel. A valid certificate was given, in this case a wildcard certificate. A second certificate was also in place, this was a certificate for the domain controller from the internal enterprise CA.

It was an educated guess, but I disabled the server certificate check for the WPA2-Enterprise conntection, and the client was able to login into the WiFi. This clearly showed, that the certificate was the problem. But it was valid, all necessary CA certificates were in place and there was no reason, why the certificate was the cause.

The customer told me, that they installed updates on friday (today is monday), and a reboot of the domain controller was issued. This also restarted the NPS service, and with this restart, the Wildcard certificate was used for client connections.

I switched to the domain controller certificate, restarted the NPS, and all Windows clients were again able to connect to the WiFi.

Lessons learned

Try to avoid Wildcard certificates, or at least check the certificate that is used by the NPS, if you get authentication error with reason code 16.

Client-specific message size limits – or the reason why iOS won’t sent emails

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, a customer complained that he could not send emails with pictures with the native iOS email app. He attached three, four or five pictures to an emails, pushed the send button and instantly an error was displayed.

We checked the different connectors as well as the organizational limit for messages. The test mails were between 10 to 20 MB, and the message size limit was much higher.

geralt / pixabay.com/ Creative Commons CC0

The cross-check with Outlook Web Access indicated, that the issue was not a configured limit on one of the Exchange connectors. Instead, a quick search directed us towards the client-specific message size limits. Especially this statement caught our attention:

For any message size limit, you need to set a value that’s larger than the actual size you want enforced. This accounts for the Base64 encoding of attachments and other binary data. Base64 encoding increases the size of the message by approximately 33%, so the value you specify should be approximately 33% larger than the actual message size you want enforced. For example, if you specify a maximum message size value of 64 MB, you can expect a realistic maximum message size of approximately 48 MB.

The message size limit for Active Sync is 10 MB (Source). This is a server limit which can’t configured using the Exchange Admin Center. Taking the 33% Base64 overhead into account, the message size limit is ~ 6,5 MB.  My customer and I were able to proof this assumption. A 10 MB mail stuck in the outbox, a 6 MB mail was sent.

How to change client-specific message size limits?

In this case, my customer and I only changed the Active Sync limit. You can use the commands below to change the limit. This will rise the limit to ~ 67 MB. Without the Base64 overhead, this values allow messages sizes up to 50 MB. You have to run these commands from an administrative CMD.

%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:appSettings /[key='MaxDocumentDataSize'].value:69730304

Make sure that you restart the IIS after the changes. Run iisreset from an administrative CMD.

Please note, that you have to run these commands after you installed an Exchange Server Cumulative Update (CU), because the files, in which the changes are made, will be overwritten by the CU. This statement is from the Microsoft:

Any customized Exchange or Internet Information Server (IIS) settings that you made in Exchange XML application configuration files on the Exchange server (for example, web.config files or the EdgeTransport.exe.config file) will be overwritten when you install an Exchange CU. Be sure save this information so you can easily re-apply the settings after the install. After you install the Exchange CU, you need to re-configure these settings.

The maximum size for a message sent by Exchange Web Services clients is 64 MB, which is much more that the 10 MB for Active Sync. This might explain why customers, that use Outlook for iOS app, might not recognize this issue.

EDIT: Today I found a blog post written by Frank Zöchling in June 2018, which addresses this topic.

Veeam and StoreOnce: Wrong FC-HBA driver/ firmware causes Windows BSoD

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

One of my customers bought a very nice new backup solution, which consists of a

  • HPE StoreOnce 5100 with ~ 144 TB usable capacity,
  • and a new HPE ProLiant DL380 Gen10 with Windows Server 2016

as new backup server. StoreOnce and backup server will be connected with 8 Gb Fibre-Channel and 10 GbE to the existing network and SAN. Veeam Backup & Replication 9.5 U3a is already in use, as well as VMware vSphere 6.5 Enterprise Plus. The backend storage is a HPE 3PAR 8200.

This setup allows the usage of Catalyst over Fibre-Channel together with Veeam Storage Snapshots, and this was intended to use.

I wrote about a similar setup some month ago: Backup from a secondary HPE 3PAR StoreServ array with Veeam Backup & Replication.

The OS on the StoreOnce was up-to-date (3.16.7), Windows Server 2016 was installed using HPE Intelligent Provisioning. Afterwards, a drivers and firmware were updated using the latest SPP 2018.11 was installed. So all drivers and firmware were also up-to-date.

After doing zoning and some other configuration tasks, I installed Veeam Backup and Replication 9.5 U3, configured my Catalyst over Fibre-Channel repository. I configured a test backup… and the server failed with a Blue Screen of Death… which is pretty rare since Server 2008 R2.

geralt / pixabay.com/ Creative Commons CC0

I did some tests:

  • backup from 3PAR Storage Snapshots to Catalyst over FC repository – BSoD
  • backup without 3PAR Storage Snapshots to Catalyst over FC repository – BSoD
  • backup from 3PAR Storage Snapshots to Catalyst over LAN repository – works fine
  • backup without 3PAR Storage Snapshots to Catalyst over LAN repository – works fine
  • backup from 3PAR Storage Snapshots to default repository – works fine
  • backup without 3PAR Storage Snapshots to default repository – works fine

So the error must be caused by the usage of Catalyst over Fibre-Channel. I filed a case at HPE, uploaded gigabytes of memory dumps and heard pretty less during the next week.

HPE StoreOnce Support Matrix FTW!

After a week, I got an email from the HPE support with a question about the installed HBA driver and firmware. I told them the version number and a day later I was requested to downgrade (!) drivers and firmware.

The customer has got a SN1100Q (P9D93A & P9D94A) HBA in his backup server, and I was requested to downgrade the firmware to version 8.05.61, as well as the driver to 9.2.5.20. And with this firmware and driver version, the backup was running fine (~ 750 MB/s hroughput).

I found the HPE StoreOnce Support Matrix on the SPOCK website from HPE. The matrix confirmed the firmware and driver version requirement (click to enlarge).

Fun fact: None of the listed HBAs (except the Synergy HBAs) is supported with the latest StoreOnce G2 products.

Lessons learned

You should take a look at those support matrices – always! HPE confirmed that the first level recommendation “Have you trieed to update to the latest firmware” can cause similar problems. The fact, that the factory ships the server with the latest firmware does not make this easier.

Out-of-Office replies are dropped due to empty MAIL FROM

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had an interesting support call. A customer noticed that Out-of-Office replies were not received by recipients, even though the OoO option were enabled for internal and external recipients. Internal recipients got the OoO reply, but none of the external recipients.

cattu/ pixabay.com/ Creative Commons CC0

The Message Tracking Log is a good point to start. I quickly discovered that the Exchange server was unable to send the OoO mails. You can use the eventid FAIL to get a list of all failed messages.

Very interesting was the RecipientStatus of a failed mail.

RecipientStatus         : {[{LED=550 Requested action not taken: mailbox unavailable};{MSG=};{FQDN=mailrelay-out.xxxx.de};{IP=213.xxx.xxx.xxx};{LRT=20.12.2018 10:22:39}]}

550 Requested action not taken: mailbox unavailable  is a pretty interesting error when sending mails over a mail relay of your ISP. Especially when other mails were successfully sent over the same mail relay.

Next stop: Protcol log of the send connector

I enabled the logging on the send connector using the EAC. This option is disabled by default. Depending on the amount of mails sent over the connector, you should make sure to disable the logging after your troubleshooting session. To enable the logging, follow these steps:

  • Open the EAC and navigate to
  • Mail flow > Send connectors
  • Select the connector you want to configure, and then click Edit
  • On the General tab in the Protocol logging level section, select the Verbose option
  • When you’re finished, click Save

The protocol log can be found under %ExchangeInstallPath%TransportRoles\Logs\Hub\ProtocolLog\SmtpSend.

After enabling the logging and another test mail, the log contained the necessary details to find the root cause. This is the interesting part of the SMTP communication:

2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,3,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,220 mailrelay-out.xxx.de ESMTP Postfix (Debian/GNU),
2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,4,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,EHLO mail.domain.local,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,5,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,250  mailrelay-out2.xxx.de SIZE 52428800 8BITMIME OK,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,6,192.168.0.212:49986,213.xxx.xxx.xxx:25,*,,sending message with RecordId 22471268892695 and InternetMessageId <b9613be791c141e3b76828228bd6cdb3@exchange.domain.local>
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,7,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,MAIL FROM:<> SIZE=4758,
2018-12-20T10:22:39.331Z,Relay,08D640AAC0AD8811,8,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,550 Requested action not taken: mailbox unavailable,
2018-12-20T10:22:39.332Z,Relay,08D640AAC0AD8811,9,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,QUIT,

The error occured right after the exchange server issued MAIL FROM:<> . But why is the MAIL FROM empty?

RFC 2298 is the key

An Out-of-Office reply is a Delivery Status Notification message. And RFC 2298 clearly states:

The envelope sender address (i.e., SMTP MAIL FROM) of the MDN MUST be
null (<>), specifying that no Delivery Status Notification messages
or other messages indicating successful or unsuccessful delivery are
to be sent in response to an MDN.

So the empty MAIL FROM is something that a mail relay should expect. In case of my customer that mail relay seems to act different. Maybe some kind of spam protection.

Database Availability Group (DAG) witness is in a failed state

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a maintenance job I had to update a 2-node Exchange Database Availability Group and a file-share witness server.

After the installation of Windows updates on the witness server and the obligatory reboot, the witness left in a failed state.

[PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -Identity DAG1 -Status | fl *wit*
WARNING: Database availability group ‘DAG01’ witness is in a failed state. The database
availability group requires the witness server to maintain quorum. Please use the
Set-DatabaseAvailabilityGroup cmdlet to re-create the witness server and the directory.

WitnessServer : fsw.domain.local
WitnessDirectory : C:\DAGFileShareWitnesses\DAG1.domain.local
AlternateWitnessServer :
AlternateWitnessDirectory :
WitnessShareInUse : InvalidConfiguration
DxStoreWitnessServers :

In my opinion, the re-creation of the witness server and the witness directory cannot be the correct way to solve this. There must be another way to solve this. In addition to this: The server was not dead. Only a reboot occured.

Check the basics

Both DAG nodes were online and working. A good starting point is a check of the cluster resources using the PowerShell.

In my case the cluster resource for the File Share Witness was in a failed state. A simple Start-ClusterResource  solved my issue immediately.

[PS] C:\Windows\system32>Get-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Failed                                            Cluster Group                                     File Share Witness


[PS] C:\Windows\system32>Get-ClusterResource | Start-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Online                                            Cluster Group                                     File Share Witness

In this case, it seems that the the cluster has marked the file share witness as unreliable, thus the resource was not started after the file share witness was back online again. In this case, I managed it to manually bring it back online by running Start-ClusterResource  on one of the DAG members.

“Cannot execute upgrade script on host” during ESXi 6.5 upgrade

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I was onsite at one of my customers to update a small VMware vSphere 6.0 U3 environment to 6.5 U2c. The environment consists of three hosts. Two hosts in a cluster, and a third host is only used to run a HPE StoreVirtual Failover Manager.

The update of the first host, using the Update Manager and a HPE custom ESX 6.5 image, was pretty flawless. But the update of the second host failed with “Cannot execute upgrade script on host”

typographyimages/ pixabay.com/ Creative Commons CC0

I checked the host and found it with ESXi 6.5 installed. But I was missing one of the five iSCSI datastores. Then I tried to patch the host with the latest patches and hit “Remidiate”. The task failed with “Cannot execute upgrade script on host”. So I did a rollback to ESXi 6.0 and tried the update again, but this time using ILO and the HPE custom ISO. But the result was the same: The host was running ESXi 6.5 after the update, but the upgrade failed with the “Upgrade Script” error. After this attempt, the host was unable to mount any of the iSCSI datastores. This was because the datastores were mounted ATS-only on the other host, and the failed host was unable to mount the datastores in this mode. Very strange…

I checked the vua.log and found this error message:

2018-11-05T16:35:56.614Z info vua[A3CAB70] [Originator@6876 sub=VUA] Command '/tmp/vuaScript-xMVUfb/precheck.py --ip=172.19.0.14' finished with exit status 1
--> stderr: --------
--> INFO:root:Running esxcfg-info
--> Traceback (most recent call last):
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 385, in run
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 788, in communicate
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/encodings/ascii.py", line 26, in decode
--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

Focus on this part of the error message:

--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

The upgrade script failed due to an illegal character in the output of esxcfg-info. First of all, I had to find out what this 0x80 character is. I checked UTF-8 and the windows1252 encoding, and found out, that 0x80 is the € (Euro) symbol in the windows-1252 encoding. I searched the output of esxcfg-info for the € symbol – and found it.

            \==+Heap : 
               |----Name............................................€A
               |----Growable........................................true
               |----Max Size........................................41848 bytes
               |----Max Available...................................40816 bytes
               |----Current Size....................................29560 bytes
               |----Current Size....................................29560 bytes
               |----Current Allocation..............................1032 bytes
               |----Current Available...............................1032 bytes
               |----Current Releasable..............................20400 bytes
               |----Percent Free of Current.........................96 
               |----Percent Free of Max.............................97 
               |----Percent Releasable..............................69

But how to get rid of it? Where does it hide in the ESXi config? I scrolled a bit up and down around the € symbol. A bit above, I found a reference to HPE_SATP_LH . This took immidiately my attention, because the customer is using StoreVirtual VSA and StoreVirtual HW appliances.

Now, my second educated guess of the day came into play. I checked the installed VIBs, and found the StoreVirtual Multipathing Extension installed on the failed host – but not on the host, where the ESXi 6.5 update was successful.

I removed the VIB from the buggy host, did a reboot, tried to update the host with the latest patches – with success! The cross-checking showed, that the € symbol was missing in the esxcfg-info  output of the host that was upgraded first. I don’t have a clue why the StoreVirtual Multipathing Extension caused this error. The customer and I decided to not install the StoreVirtual Multipathing Extension again.

Veeam backups fails because of time differences

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.

jarmoluk/ pixabay.com/ Creative Commons CC0

The backup job included two VMs, and the backup of one of these VMs failed with this error:

Error: Failed to open VDDK disk [[VMDS-SAS-01] VMDC1/VMDC1_1.vmdk] ( is read-only mode - [true] ) 
Failed to open virtual disk Logon attempt with parameters [VC/ESX: [vcenter.domain.tld];Port: 443;Login: 
[AD\Administrator];VMX Spec: [moref=vm-59];Snapshot mor: [snapshot-20226];Transports: [san];Read Only: [true]]
failed because of the following errors: Failed to open virtual disk Logon attempt with parameters 
VC/ESX: [vcenter.domain.tld];Port: 443;Login: [AD\Administrator

The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\Veeam\Backup (look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:

Insufficient permissions in the host operating system

The user, that was used to connect to the vCenter, was an Active Directory located account. The account were granted administrator privileges root of the vCenter. Switching from an AD located account to Administrator@vsphere.local solved the issue. Next stop: vmware-sts-idmd.log on the vCenter Server appliance. The error found in this log confirmed my theory, that there was an issue with the authentication itself, not an issue with the AD located account.

[2018-07-04T11:59:49.848+02:00 vsphere.local        142f5216-8316-4752-b02c-e02be4154816 INFO ] [VmEventAppender] EventLog: source=[VMware Identity Server], tenant=[vsphere.local], eventid=[USER_NAME_PWD_AUTH_FAILED], level=[ERROR], category=[VMEVENT_CATEGORY_IDM], text=[Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]], detailText=[com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]
[2018-07-04T11:59:49.848+02:00 vsphere.local        142f5216-8316-4752-b02c-e02be4154816 ERROR] [IdentityManager] Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]
com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]

[2018-07-04T12:10:41.603+02:00 vsphere.local        64051ea1-0d7f-453d-8e34-92f0c8c37e77 INFO ] [IdentityManager] Authentication succeeded for user [AD\Administrator] in tenant [vsphere.local] in [37] milliseconds with provider [ad.domain.tld] of type [com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider]

To make a long story short: Time differences. The vCenter, the ESXi hosts and some servers had the wrong time. vCenter and ESXi hosts were using the Domain Controllers as time source.

This is the ntpq  output of the vCenter. You might notice the jitter values on the right side, both noted in milliseconds.

vcenter:/storage/log/vmware/sso # ntpq
ntpq> peer
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
vmdc2.ad.       192.168.16.11    2 u   53   64  363    0.532  207.553 152007.
vmdc1.ad.       .LOCL.           1 u    2   64  377    0.257  204.559 161964.

After some investigation, the root cause seemed to be a bad DCF77 receiver, which was connected to the domain controller that was hosting the PDC Emulator role. The DCF77 receiver was connected using an USB-2-LAN converter. Instead of using a DCF77 receiver, the customer and I implemented a NTP hierarchy using a valid NTP source on the internet (pool.ntp.org).

Demystifying “Interfaces on which heartbeats are not seen”

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

By accident, I found a heartbeat/ VLAN issue on a NetScaler cluster at one of my customers. The NetScaler ADC appliances have three interfaces connected to a switch stack. Two of the three interfaces were configured as a channel (LAG). This is a snippet from the config:

set channel LA/1 -tagall ON -throughput 0 -lrMinThroughput 0 -bandwidthHigh 0 -bandwidthNormal 0
...
bind vlan 10 -ifnum 1/3
bind vlan 10 -ifnum LA/1 -tagged
bind vlan 54 -ifnum LA/1 -tagged
bind vlan 55 -ifnum LA/1 -tagged

On the switch stack, the port to which interface 1/3 is connected, is configured as an access port. The ports, to which the channel is connected, is configured as a trunk port with some permitted VLANs. The customer is using HPE Comware based switches. The terminology is the same for Cisco. If you use HPE ProVision or Alcatel Lucent Enterprise, translate “access” to “untagged” and “trunk” to “tagged”. Because the channel is configured as a trunk port on the switch, the tagall option was set.

Issue

While examining the output of  show ha node I saw this:

Interfaces on which heartbeats are not seen : LA/1

Because interface 1/3 was not affected, this had to be a VLAN issue. During the initial troubleshooting, I was able to discover heartbeat packets in VLAN 1 and in VLAN 10.

Solution

The solution was easy: Remove the tagged option for VLAN 10 on LA/1.

bind vlan 10 -ifnum LA/1

instead of

bind vlan 10 -ifnum LA/1 -tagged

Because of the configured tagall  option, all packets sourced by LA/1 are tagged with the corrosponding VLAN ID. But because it’s now explicitly configured without a tag for VLAN 10, VLAN 10 is now also the native VLAN for LA/1.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, native vlan=10, MAC=02:e0:ed:38:9d:d2, uptime 1362h58m51s

Now the NetScaler was sending heartbeat packets with a tag for VLAN 10, and the issue was solved.

Explanation

Heartbeat packets are always send without a VLAN tag (untagged). There are two exceptions:

  • The NSVLAN is configured with a specific VLAN ID, or
  • an interface used for hearbeats is configured with the tagall

In this case, the heartbeat packets are tagged with the ID of the native VLAN ID of the interface. A show interface of the channel showed, that the channel was using VLAN 1 as the native VLAN.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, native vlan=1, MAC=02:e0:ed:38:9d:d2, uptime 1362h55m13s

How does the NetScaler determine the native VLAN for an interface? The native VLAN is the VLAN, to which an interface is bound untagged. An interface can only be bound untagged to a single VLAN. But it can be bound tagged to multiple VLANs.

If you take a look at the config snippet at the top of this blog post, you might notice, that interface 1/3 is bound untagged to VLAN 10. So this is the native VLAN for interface 1/3. But this interface is not using the tagall  option. Therefore, heartbeat packets are not tagged. The channel LA/1 is bound tagged to VLAN 10. But it was also bound to VLAN 1, without the tagged  option. This caused, that VLAN 1 was used as the native VLAN for channel LA/1. And because LA/1 is configured with the tagall  option, the heartbeats were tagged with a tag for VLAN 1. That’s why I was able to see the heartbeats, that were send over channel LA/1, in VLAN 1.

In the end, the NetScaler appliances were sending heartbearts from interface 1/3 to VLAN 10, and from channel LA/1 to VLAN 1. This caused the message “Interfaces on which heartbeats are not seen: LA/1”.