Tag Archives: troubleshooting

Out-of-Office replies are dropped due to empty MAIL FROM

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had an interesting support call. A customer noticed that Out-of-Office replies were not received by recipients, even though the OoO option were enabled for internal and external recipients. Internal recipients got the OoO reply, but none of the external recipients.

cattu/ pixabay.com/ Creative Commons CC0

The Message Tracking Log is a good point to start. I quickly discovered that the Exchange server was unable to send the OoO mails. You can use the eventid FAIL to get a list of all failed messages.

Very interesting was the RecipientStatus of a failed mail.

RecipientStatus         : {[{LED=550 Requested action not taken: mailbox unavailable};{MSG=};{FQDN=mailrelay-out.xxxx.de};{IP=213.xxx.xxx.xxx};{LRT=20.12.2018 10:22:39}]}

550 Requested action not taken: mailbox unavailable  is a pretty interesting error when sending mails over a mail relay of your ISP. Especially when other mails were successfully sent over the same mail relay.

Next stop: Protcol log of the send connector

I enabled the logging on the send connector using the EAC. This option is disabled by default. Depending on the amount of mails sent over the connector, you should make sure to disable the logging after your troubleshooting session. To enable the logging, follow these steps:

  • Open the EAC and navigate to
  • Mail flow > Send connectors
  • Select the connector you want to configure, and then click Edit
  • On the General tab in the Protocol logging level section, select the Verbose option
  • When you’re finished, click Save

The protocol log can be found under %ExchangeInstallPath%TransportRoles\Logs\Hub\ProtocolLog\SmtpSend.

After enabling the logging and another test mail, the log contained the necessary details to find the root cause. This is the interesting part of the SMTP communication:

2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,3,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,220 mailrelay-out.xxx.de ESMTP Postfix (Debian/GNU),
2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,4,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,EHLO mail.domain.local,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,5,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,250  mailrelay-out2.xxx.de SIZE 52428800 8BITMIME OK,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,6,192.168.0.212:49986,213.xxx.xxx.xxx:25,*,,sending message with RecordId 22471268892695 and InternetMessageId <b9613be791c141e3b76828228bd6cdb3@exchange.domain.local>
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,7,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,MAIL FROM:<> SIZE=4758,
2018-12-20T10:22:39.331Z,Relay,08D640AAC0AD8811,8,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,550 Requested action not taken: mailbox unavailable,
2018-12-20T10:22:39.332Z,Relay,08D640AAC0AD8811,9,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,QUIT,

The error occured right after the exchange server issued MAIL FROM:<> . But why is the MAIL FROM empty?

RFC 2298 is the key

An Out-of-Office reply is a Delivery Status Notification message. And RFC 2298 clearly states:

The envelope sender address (i.e., SMTP MAIL FROM) of the MDN MUST be
null (<>), specifying that no Delivery Status Notification messages
or other messages indicating successful or unsuccessful delivery are
to be sent in response to an MDN.

So the empty MAIL FROM is something that a mail relay should expect. In case of my customer that mail relay seems to act different. Maybe some kind of spam protection.

Database Availability Group (DAG) witness is in a failed state

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a maintenance job I had to update a 2-node Exchange Database Availability Group and a file-share witness server.

After the installation of Windows updates on the witness server and the obligatory reboot, the witness left in a failed state.

[PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -Identity DAG1 -Status | fl *wit*
WARNING: Database availability group ‘DAG01’ witness is in a failed state. The database
availability group requires the witness server to maintain quorum. Please use the
Set-DatabaseAvailabilityGroup cmdlet to re-create the witness server and the directory.

WitnessServer : fsw.domain.local
WitnessDirectory : C:\DAGFileShareWitnesses\DAG1.domain.local
AlternateWitnessServer :
AlternateWitnessDirectory :
WitnessShareInUse : InvalidConfiguration
DxStoreWitnessServers :

In my opinion, the re-creation of the witness server and the witness directory cannot be the correct way to solve this. There must be another way to solve this. In addition to this: The server was not dead. Only a reboot occured.

Check the basics

Both DAG nodes were online and working. A good starting point is a check of the cluster resources using the PowerShell.

In my case the cluster resource for the File Share Witness was in a failed state. A simple Start-ClusterResource  solved my issue immediately.

[PS] C:\Windows\system32>Get-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Failed                                            Cluster Group                                     File Share Witness


[PS] C:\Windows\system32>Get-ClusterResource | Start-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Online                                            Cluster Group                                     File Share Witness

In this case, it seems that the the cluster has marked the file share witness as unreliable, thus the resource was not started after the file share witness was back online again. In this case, I managed it to manually bring it back online by running Start-ClusterResource  on one of the DAG members.

“Cannot execute upgrade script on host” during ESXi 6.5 upgrade

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I was onsite at one of my customers to update a small VMware vSphere 6.0 U3 environment to 6.5 U2c. The environment consists of three hosts. Two hosts in a cluster, and a third host is only used to run a HPE StoreVirtual Failover Manager.

The update of the first host, using the Update Manager and a HPE custom ESX 6.5 image, was pretty flawless. But the update of the second host failed with “Cannot execute upgrade script on host”

typographyimages/ pixabay.com/ Creative Commons CC0

I checked the host and found it with ESXi 6.5 installed. But I was missing one of the five iSCSI datastores. Then I tried to patch the host with the latest patches and hit “Remidiate”. The task failed with “Cannot execute upgrade script on host”. So I did a rollback to ESXi 6.0 and tried the update again, but this time using ILO and the HPE custom ISO. But the result was the same: The host was running ESXi 6.5 after the update, but the upgrade failed with the “Upgrade Script” error. After this attempt, the host was unable to mount any of the iSCSI datastores. This was because the datastores were mounted ATS-only on the other host, and the failed host was unable to mount the datastores in this mode. Very strange…

I checked the vua.log and found this error message:

2018-11-05T16:35:56.614Z info vua[A3CAB70] [[email protected] sub=VUA] Command '/tmp/vuaScript-xMVUfb/precheck.py --ip=172.19.0.14' finished with exit status 1
--> stderr: --------
--> INFO:root:Running esxcfg-info
--> Traceback (most recent call last):
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 385, in run
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 788, in communicate
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/encodings/ascii.py", line 26, in decode
--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

Focus on this part of the error message:

--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

The upgrade script failed due to an illegal character in the output of esxcfg-info. First of all, I had to find out what this 0x80 character is. I checked UTF-8 and the windows1252 encoding, and found out, that 0x80 is the € (Euro) symbol in the windows-1252 encoding. I searched the output of esxcfg-info for the € symbol – and found it.

            \==+Heap : 
               |----Name............................................€A
               |----Growable........................................true
               |----Max Size........................................41848 bytes
               |----Max Available...................................40816 bytes
               |----Current Size....................................29560 bytes
               |----Current Size....................................29560 bytes
               |----Current Allocation..............................1032 bytes
               |----Current Available...............................1032 bytes
               |----Current Releasable..............................20400 bytes
               |----Percent Free of Current.........................96 
               |----Percent Free of Max.............................97 
               |----Percent Releasable..............................69

But how to get rid of it? Where does it hide in the ESXi config? I scrolled a bit up and down around the € symbol. A bit above, I found a reference to HPE_SATP_LH . This took immidiately my attention, because the customer is using StoreVirtual VSA and StoreVirtual HW appliances.

Now, my second educated guess of the day came into play. I checked the installed VIBs, and found the StoreVirtual Multipathing Extension installed on the failed host – but not on the host, where the ESXi 6.5 update was successful.

I removed the VIB from the buggy host, did a reboot, tried to update the host with the latest patches – with success! The cross-checking showed, that the € symbol was missing in the esxcfg-info  output of the host that was upgraded first. I don’t have a clue why the StoreVirtual Multipathing Extension caused this error. The customer and I decided to not install the StoreVirtual Multipathing Extension again.

Veeam backups fails because of time differences

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.

jarmoluk/ pixabay.com/ Creative Commons CC0

The backup job included two VMs, and the backup of one of these VMs failed with this error:

Error: Failed to open VDDK disk [[VMDS-SAS-01] VMDC1/VMDC1_1.vmdk] ( is read-only mode - [true] ) 
Failed to open virtual disk Logon attempt with parameters [VC/ESX: [vcenter.domain.tld];Port: 443;Login: 
[AD\Administrator];VMX Spec: [moref=vm-59];Snapshot mor: [snapshot-20226];Transports: [san];Read Only: [true]]
failed because of the following errors: Failed to open virtual disk Logon attempt with parameters 
VC/ESX: [vcenter.domain.tld];Port: 443;Login: [AD\Administrator

The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\Veeam\Backup (look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:

Insufficient permissions in the host operating system

The user, that was used to connect to the vCenter, was an Active Directory located account. The account were granted administrator privileges root of the vCenter. Switching from an AD located account to Administrator@vsphere.local solved the issue. Next stop: vmware-sts-idmd.log on the vCenter Server appliance. The error found in this log confirmed my theory, that there was an issue with the authentication itself, not an issue with the AD located account.

[2018-07-04T11:59:49.848+02:00 vsphere.local        142f5216-8316-4752-b02c-e02be4154816 INFO ] [VmEventAppender] EventLog: source=[VMware Identity Server], tenant=[vsphere.local], eventid=[USER_NAME_PWD_AUTH_FAILED], level=[ERROR], category=[VMEVENT_CATEGORY_IDM], text=[Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]], detailText=[com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]
[2018-07-04T11:59:49.848+02:00 vsphere.local        142f5216-8316-4752-b02c-e02be4154816 ERROR] [IdentityManager] Failed to authenticate principal [AD\Administrator]. Native platform error [code: 851968][null][null]
com.vmware.identity.interop.idm.IdmNativeException: Native platform error [code: 851968][null][null]

[2018-07-04T12:10:41.603+02:00 vsphere.local        64051ea1-0d7f-453d-8e34-92f0c8c37e77 INFO ] [IdentityManager] Authentication succeeded for user [AD\Administrator] in tenant [vsphere.local] in [37] milliseconds with provider [ad.domain.tld] of type [com.vmware.identity.idm.server.provider.activedirectory.ActiveDirectoryProvider]

To make a long story short: Time differences. The vCenter, the ESXi hosts and some servers had the wrong time. vCenter and ESXi hosts were using the Domain Controllers as time source.

This is the ntpq  output of the vCenter. You might notice the jitter values on the right side, both noted in milliseconds.

vcenter:/storage/log/vmware/sso # ntpq
ntpq> peer
     remote           refid      st t when poll reach   delay   offset  jitter
==============================================================================
vmdc2.ad.       192.168.16.11    2 u   53   64  363    0.532  207.553 152007.
vmdc1.ad.       .LOCL.           1 u    2   64  377    0.257  204.559 161964.

After some investigation, the root cause seemed to be a bad DCF77 receiver, which was connected to the domain controller that was hosting the PDC Emulator role. The DCF77 receiver was connected using an USB-2-LAN converter. Instead of using a DCF77 receiver, the customer and I implemented a NTP hierarchy using a valid NTP source on the internet (pool.ntp.org).

Demystifying “Interfaces on which heartbeats are not seen”

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

By accident, I found a heartbeat/ VLAN issue on a NetScaler cluster at one of my customers. The NetScaler ADC appliances have three interfaces connected to a switch stack. Two of the three interfaces were configured as a channel (LAG). This is a snippet from the config:

set channel LA/1 -tagall ON -throughput 0 -lrMinThroughput 0 -bandwidthHigh 0 -bandwidthNormal 0
...
bind vlan 10 -ifnum 1/3
bind vlan 10 -ifnum LA/1 -tagged
bind vlan 54 -ifnum LA/1 -tagged
bind vlan 55 -ifnum LA/1 -tagged

On the switch stack, the port to which interface 1/3 is connected, is configured as an access port. The ports, to which the channel is connected, is configured as a trunk port with some permitted VLANs. The customer is using HPE Comware based switches. The terminology is the same for Cisco. If you use HPE ProVision or Alcatel Lucent Enterprise, translate “access” to “untagged” and “trunk” to “tagged”. Because the channel is configured as a trunk port on the switch, the tagall option was set.

Issue

While examining the output of  show ha node I saw this:

Interfaces on which heartbeats are not seen : LA/1

Because interface 1/3 was not affected, this had to be a VLAN issue. During the initial troubleshooting, I was able to discover heartbeat packets in VLAN 1 and in VLAN 10.

Solution

The solution was easy: Remove the tagged option for VLAN 10 on LA/1.

bind vlan 10 -ifnum LA/1

instead of

bind vlan 10 -ifnum LA/1 -tagged

Because of the configured tagall  option, all packets sourced by LA/1 are tagged with the corrosponding VLAN ID. But because it’s now explicitly configured without a tag for VLAN 10, VLAN 10 is now also the native VLAN for LA/1.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, native vlan=10, MAC=02:e0:ed:38:9d:d2, uptime 1362h58m51s

Now the NetScaler was sending heartbeat packets with a tag for VLAN 10, and the issue was solved.

Explanation

Heartbeat packets are always send without a VLAN tag (untagged). There are two exceptions:

  • The NSVLAN is configured with a specific VLAN ID, or
  • an interface used for hearbeats is configured with the tagall

In this case, the heartbeat packets are tagged with the ID of the native VLAN ID of the interface. A show interface of the channel showed, that the channel was using VLAN 1 as the native VLAN.

> show channel

1)      Interface LA/1 (802.3ad Link Aggregate) #14
        flags=0x4100c020 <ENABLED, UP, AGGREGATE, UP, HAMON, HEARTBEAT, 802.1q, tagall>
        MTU=1500, native vlan=1, MAC=02:e0:ed:38:9d:d2, uptime 1362h55m13s

How does the NetScaler determine the native VLAN for an interface? The native VLAN is the VLAN, to which an interface is bound untagged. An interface can only be bound untagged to a single VLAN. But it can be bound tagged to multiple VLANs.

If you take a look at the config snippet at the top of this blog post, you might notice, that interface 1/3 is bound untagged to VLAN 10. So this is the native VLAN for interface 1/3. But this interface is not using the tagall  option. Therefore, heartbeat packets are not tagged. The channel LA/1 is bound tagged to VLAN 10. But it was also bound to VLAN 1, without the tagged  option. This caused, that VLAN 1 was used as the native VLAN for channel LA/1. And because LA/1 is configured with the tagall  option, the heartbeats were tagged with a tag for VLAN 1. That’s why I was able to see the heartbeats, that were send over channel LA/1, in VLAN 1.

In the end, the NetScaler appliances were sending heartbearts from interface 1/3 to VLAN 10, and from channel LA/1 to VLAN 1. This caused the message “Interfaces on which heartbeats are not seen: LA/1”.

Unsupported hardware family ‘vmx-06’

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer of mine got an appliance from a software vendor. The appliance was delivered as ZIP file with a VMDK, a MF, and an OVF file. Unfortunately, the appliance was created with VMware Workstation 6.0 with virtual machine hardware version 6, which is incompatible with VMware ESXi (Virtual machine hardware versions). During deployment, my customer got this error:

unsupported hardware family 'vmx-06'

The OVF file includes a line with the VM hardware version.

<vssd:VirtualSystemType>vmx-06</vssd:VirtualSystemType>

If you change this line from vmx-06 to vmx-07, the hash of the OVF changes, and you will get an error during the deployment of the appliance because of the wrong file hash.

Solution

You have to change the SHA256 hash of the OVF, which is included in the MF file.

SHA256(appliance-d9-64-bit.ovf)= 46b84a48a03a8b183ff88168ad43e56bd95ffaa78d0a5a8b2b8cf87e0d45f36a
SHA256(appliance-d9-64-bit-file1.iso)= ec78bc48b48d676775b60eda41528ec33c151c2ce7414a12b13d9b73d34de544
SHA256(appliance-d9-64-bit-disk1.vmdk)= 6b4c1bea5706ce554630b5b5407ac31434e4b41a81930e8cc46f36511085fcd9

To create the new SHA256 hash, you can use the PowerShell cmdlet Get-FileHash .

PS C:\Users\p.terlisten\Downloads> Get-FileHash -Algorithm SHA256 .\appliance-d9-64-bit.ovf

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          46B84A48A03A8B183FF88168AD43E56BD95FFAA78D0A5A8B2B8CF87E0D45F36A       C:\Users\p.terlisten\Download...


PS C:\Users\p.terlisten\Downloads> Get-FileHash -Algorithm SHA256 .\endos-d9-64-bit.ovf

Algorithm       Hash                                                                   Path
---------       ----                                                                   ----
SHA256          C14954237907AB45F75C669B5AD2B0A8159096D8526064AC5646F71066DE5C94       C:\Users\p.terlisten\Download...

Replace the hash and save the MF file. Then re-deploy the appliance.

Andreas Lesslhumer wrote a similar blog post in 2015:
“Unsupported hardware family vmx-10” during OVF import

Exchange DAG member dies during snapshot creation

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Yesterday, a customer called me and told me about a scary observation on one of his Exchange 2016 DAG (Database Availability Groups) nodes.

In preparation of a security check, my customer created a snapshot of a Exchange 2016 DAG node. This node is part of a two node Windows Server 2012 R2/ Exchange 2016 CU7 cluster.

That something went wrong was instantly clear, after the first alarm messages were received. My customer opened a console windows and saw, that the VM was booting.

What went wrong?

Nothing. Something worked as designed, except the fact, that the observed behaviour was not intended.

That a snapshot was created was clearly visible in the logs. Interesting was the amount of time, that the snapshot creation took. It took 5 minutes from the start of the snapshot creation until the task finished. During this time, pretty much data was written to the disks.

VMware vSphere Throughput Snapshot Creation

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The server eventlog contained an entry, that pointed me to to the right direction.

Event Type: Information
Event ID: 1001
Source: BugCheck
Description: The computer has rebooted from a bugcheck. The bugcheck was 0x0000009E (0xffffe0001eccf900, 0x000000000000003c, 0x000000000000000a, 0x0000000000000000).

The Ask the Core Team wrote a nice blog post about this STOP error. In short: The failvoer clustering service incorporates a detection mechanism that may detect unresponsive user-mode processes. If an unresponsive user-mode process is detected, a HangRecoveryAction is called. Since Windows Server 2008, a STOP error (Bugcheck) is caused on the cluster node.

Most likely hypothesis

My explanation of the observed behaviour is, that my customer accidentally created a snapshot that has contained the VM memory. Because the Exchange server has 32 GB memory, the snapshot creation took some time and the VM became unresponsive. As the VM was responding again, the HangRecoveryAction did its dirty job.

Check if the checkbox for the VM memory is disabled, before you create a snapshot. Otherwise the bugcheck will do its job. Please note, that you might see this behaviour in all Microsoft Windows Failover Clusters, not only with Microsoft Exchange.

Exchange receive connector rejects incoming connections

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a bigger Microsoft Exchange migration, one of my customers moved the in- and outbound mailflow to a newly installed mail relay cluster. We modified MX records to move the mailflow to the new mail relay, because the customer also switched the ISP. While changing the MX records for ~40 domains, and therefore more and more mails received through the new mail relay cluster, we noticed events from MSExchangeTransport (event id 1021):

Receive connector Default Frontend EXCHANGE rejected an incoming connection
from IP address 192.168.xxx.xxx. The maximum number of connections per source
20) for this connector has been reached by this source IP address.

192.168.xxx.xxx is the mail relay cluster, which is used for the in- and outbound mailflow.

This event indicates that the remote server has reached the maximum number of simultaneous incoming connections to the receive connector. This value is specified by the MaxInboundConnectionPerSource  parameter, and the default value is 20. You can easily increase the value using the Set-ReceiveConnector  cmdlet.

Set-ReceiveConnector - identity "Default Frontend EXCHANGE" -MaxInboundconnectionPersource 100

Microsoft has decreased this value over time. It was 100 in Exchange 2007, but 20 since Exchange 2010.

Roaming of AppData-Local breaks Windows 10 Start Menu

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

One of my customers has started a project to create a Windows 10 Enterprise (LTSB 2016) master for their VMware Horizon View environment. Beside the fact (okay, it is more a personal feeling), that Windows 10 is a real PITA for VDI, I noticed an interesting issue during tests.

The issue

For convenience, I adopted some settings of the current Persona Management GPO for Windows 7 for the new Windows 10 environment. During the tests, the customer and I noticed a strange behaviour: After login, the start menu won’t open. The only solution was to logoff and delete the persona folder (most folders are redirected using native Folder Redirections, not the redirection feature of the View Persona Management). While debugging this issue, I found this error in the eventlog.

Faulting application name: ShellExperienceHost.exe, version: 10.0.14393.477, time stamp: 0x5819bf85
Faulting module name: Windows.UI.Xaml.dll, version: 10.0.14393.477, time stamp: 0x58ba5c3d
Exception code: 0xc000027b
Fault offset: 0x00000000006d611b
Faulting process id: 0x1548
Faulting application start time: 0x01d1ab8009bce144
Faulting application path: C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
Faulting module path: C:\Windows\System32\Windows.UI.Xaml.dll
Report Id: f38fda11-bd15-46d6-bf4a-d26a348218d5
Faulting package full name: Microsoft.Windows.ShellExperienceHost_10.0.14393.953_neutral_neutral_cw5n1h2txyewy
Faulting package-relative application ID: App

If you google this, you will find many, many threads about this. Most solutions describe, that you have to delete the profile due to wrong permissions on profile folders and/ or registry hives. I used Microsofts Procmon to verify this, but I was unable to confirm that. After further investigations, I found hints, that the TileDataLayer database could be the problem. The database is located in AppData\Local\TileDataLayer\Database and stores the installed apps, programs, and tiles for the Start Menu. AFAIK it also includes the Start Menu layout.

Windows 10 Tile Data Layer Files

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The database is part of the local part of the profile. A quick check proved, that it’s sufficient to delete the TileDataLayer folder. It will be recreated during the next logon.

The solution

It’s simple: Don’t roam AppData\Local. It should not be necessary to roam the local part of a users profile. The View Persona Management offers an option to roam the local part the profile. You can configure this behaviour with a GPO setting.

Horizon View Persona Management Roaming GPO Settings

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

You can find this setting under Computer Configuration > Administrative Templates > VMware View Agent Configuration > Persona Management > Roaming & Synchronization

I was able to reproduce the observed behavior in my lab with Windows 10 Enterprise (LTSB 2016) and Horizon View 7.0.3. Because of this, I tend to recommend not to roam AppData\Local.

Checking the 3PAR Quorum Witness appliance

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

3PAR Quorum Witness Status

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

How to check the Quorum Witness appliance?

You can check the status of the appliance with a simple web request. The documentation shows a simple test based on curl. You can run this direct from the BASH of the appliance.

[[email protected] ~]# curl http://10.0.0.99:8080
{"couchdb":"Welcome","version":"1.0.4"}
[[email protected] ~]#

But you can also use the PowerShell cmdlet Invoke-WebRequest.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080


StatusCode        : 200
StatusDescription : OK
Content           : {"couchdb":"Welcome","version":"1.0.4"}

RawContent        : HTTP/1.1 200 OK
                    Content-Length: 40
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:31:37 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"couchdb...
Forms             : {}
Headers           : {[Content-Length, 40], [Cache-Control, must-revalidate], [Content-Type, text/plain;charset=utf-8],
                    [Date, Mon, 30 Jan 2017 08:31:37 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 40

If you add /witness to the URL, you can test the access to the database, which is used for Peer Persistence.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080/witness


StatusCode        : 200
StatusDescription : OK
Content           : {"db_name":"witness","doc_count":5,"doc_del_count":4,"update_seq":149557915,"purge_seq":0,"compact_
                    running":false,"disk_size":48988254,"instance_start_time":"1485763322826940","disk_format_version":
                    5,...
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 234
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:36:38 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"db_nam...
Forms             : {}
Headers           : {[Content-Length, 234], [Cache-Control, must-revalidate], [Content-Type,
                    text/plain;charset=utf-8], [Date, Mon, 30 Jan 2017 08:36:38 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 234

If you get a connection error, check if the beam process is running.

[[email protected] ~]# netstat -tulpen |grep 8080
tcp        0      0 0.0.0.0:8080                0.0.0.0:*                   LISTEN      495        10726      1643/beam
[[email protected] ~]#

If not, reboot the appliance. This can be done without downtime. The appliance comes only into play, if a failover occurs.