Tag Archives: vsphere

Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7

I’ve got several mails and comments about this topic. It looks like that the latest ESXi 6.7 updates are causing some trouble on HPE ProLiant Gen10 servers.

I’ve blogged about recurring host hardware sensor state alarm messages some weeks ago. A customer noticed them after an update. Last week, I got the first comments under this blog post abot fan failure messages after applying the latest ESXi 6.7 updates. Then more and more customers asked me about this, because they got these messages too in their environment after applying the latest updates.

Last Saturday I tweeted my blog post to give a hint to my followers who may be experiencing the same problem.

Fortunately one of my followers (Thanks Markus!) pointed me to a VMware KB article with a workaround: Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7 (78989).

This is NOT a solution, but a workaround. Keep that in Mind.

Thanks again to Markus. Make sure to visit his awesome blog (MY CLOUD-(R)EVOLUTION) , especially if you are interested in vSphere, Veeam and automation!

Space reclamation of VMFS 5 Datastores using esxcli

It was a bit quiet here in January caused by a new “private project” which has attracted some resources, and will pull more resources in the future.

But this will not stop me from documenting useful stuff. This one is nothing new, but commonly asked by some customers: How do I get my storage capacity back after deleting VMs?!

The outlined steps are all done using esxcli. You need to execute them on a single ESXi host, not on each host in the cluster.

Connect to one of your ESXi hosts using SSH. You can use this small PowerCLI command to enable SSH on a specific host.

The first step is to identify the datastore(s) from which you want to reclaim storage.

We will need the device name, and later the UUID. The next step is to identify if the device is detected as a thin-provisioned disk, and if it is VAAI-capable. I’ve shortened the output of the esxcli output to the necessary output.

No we have to verify if all necessary VAAI options are supported.

Important for us is the “Delete” primitive. If this is supported, we can use UNMAP to reclaim storage.

This process will take some time depending on the amount of storage that has to be reclaimed. And it will put some load on your storage, so you might want to run this in a less productive time.

Why we need a vSAN licensing for SMB customers

Not every customer is running a full-blown vSphere Enterprise Plus licensing. To be honest, when I look at the number of sold licenses, most of my customers are running vSphere Essentials Plus. Not Essentials, nor Standard or Enterprise (Plus), but two or three hosts with Essentials Plus. And that’s perfectly fine!

Two or three hosts with 10 GbE and pretty often 12G SAS. Some of them with Fibre-Channel, nearly no one with iSCSI. My colleagues and I developed a pretty rock solid setup over the last years, which we sell like some kind of building block: HPE ProLiant, HPE MSA, Aruba Switches, vSphere Essentials Plus. A perfect setup for most of our customers, which run something between 10 and 30 VMs on it. Some of them also add Horizon View (Add-On) to it.

But requirements change. More customers ask for more hosts. When customers break out of the Essentials Plus licensing, then often because of the host limitation. Less of them do this because they need DRS or even Storage vMotion.

Some of my customers have heard about vSAN and they like the idea behind it. Especially when you take into account, that hardware costs decrease and flash storage is getting cheaper. But when you discuss the idea of combining vSAN and Essentials licensing, you will hit the host limitation early.

VMware itself states in the vSAN licensing guide:

The 2-node vSAN deployment model is not restricted to a specific vSAN license edition. In other words, any of the licensing editions can be used with a 2-host configuration. vSphere Essentials Kit or vSphere Essentials Plus Kit licensing limits the number of hosts managed by
vCenter Server Essentials to three. The vSAN witness host – virtual appliance or physical – is considered a host in these Essentials licensing bundles.

Source: VMware vSAN Licensing Guide

When you take a look at the Horizon Desktop licensing, or at the RoBo licensing, you will see another kind of limitation: Limiting the number of VMs, not the number of hosts. This is pretty interesting when you think about combining vSAN and Essentials licensing.

Why not offering a “HCI Essentials Kit” limitied to 25 VMs, and the features offered by Essentials Plus and vSAN Standard? This would allow customers to run four or five hosts with vSAN. By limiting the number of VMs, customers can scale-out their infrastructure in terms of capacity.

Hey VMware, you might think about this over the Christmas holiday. ;) There is a customer segment that is not yet sufficiently addressed by your sales team. This is a chance for more YoY growth. ;)

vCenter Migration from 6.0 to 6.7 fails due to missing user role

Actually, yesterday should be the day at which I migrate one of the last physical Windows vCenter servers installed in my customer base. Actually… the migration failed twice. And each time I had to rollback, power-on the old physical server, reset the computer account etc.

The update was from VMware vCenter Server 6.0 Update 3d (7462484) on a Windows 2012 R2 server to vCenter Server 6.7 Update 3 (Appliance). The migration failed at 62% with the following message:

I found the same error in the content-library-firstboot.py_9150_stderr.log file of the downloaded log bundle.

Okay, that’s a pretty long error message and I had no idea where I should start searching. But it seems related to the Content Library of the vCenter. And it looks like it is related to the privileges.

A forum post led me to the content library administrator role. The author had to deal with a failed migration (6.5 to 6.7), but his conten administrator role was missing. In my case, the role was existent.

Sorry for the german translation. As you can see, the role was existent… Obviously. I tried to add a new role with the name com.vmware.Content.Admin, as mentioned in the forum post, and… a new role appeared.

You might notice the “Beispiel” or “Example”. That’s the difference. Whatever the other role is or what its look like, it is definitely not the original content library administrator role.

And to make a long story short: The migration was successful after this small change.

User vdcs does not have the expected uid 1006

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Sorry for the long delay since my last blog post – busy times, but with lots of vSphere. :) Today, I did an upgrade of a standalone vCenter Server Appliance at one of my healthcare customers. The vCenter was on 6.0 U3 and I had to upgrade it to 6.7 U2. It was only a small deployment with three hosts, so nothing fancy. And as with in many other vSphere upgrades, I came across this warning message:

Warning User vdcs does not have the expected uid 1006
Resolution Please refer to the corresponding KB article.

I saw this message multiple times, but in the past, there was no KB article about this, only a VMTN thread. And this thread mentioned, that you can safely ignore this message, if you don’t use a Content Library. Confirmation enough to proceed with the upgrade. :)

Meanwhile, there is a KB article:

Uploading content to the library fails with error: Content Library Service does not have write permission on this storage backing (52559)

This is a statement from the KB article:

Note: You can safely ignore this message if you are not using Content Library Service before the upgrade, or using it only for libraries not backed by NFS storage.

Currently, I don’t have cusomters with NFS backed Content Libraries, but if you do, you might want to take a look at it. Especially if you have done an upgrade from 6.0 to 6.5 or 6.7 and you want to start using Content Libraries now.

Poor performance with Windows 10/ 2019 1809 on VMFS 6

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

THIS IS FIXED in ESXi 6.5 U3 and 6.7 U3.

See KB67426 (Performance issues with Windows 10 version 1809 VMs running on snapshots) for more information.

TL;DR: This bug is still up to date and has not been fixed yet! Some user in the VMTN thread mentioned a hotpatch from VMware, which seems to be pulled. A fix for this issue will be available with ESXi 6.5 U3 and 6.7 U3. The only workaround is to place VMs on VMFS 5 datastores, or avoid the use of snapshots if you have to use VMFS 6. I can confirm, that Windows 1903 is also affected.

One of my customers told me that they have massive performance problems with a Horizon View deployment at one of their customers. We talked about this issue and they mentioned, that this was related to Windows 10 1809 and VMFS 6. A short investigation showed, that this issue was well known, and even VMware is working on this. In their case, another IT company installed the Cisco HyperFlex solution and the engineer was unaware of this issue.

Image by Manfred Antranias Zimmer from Pixabay

What do we know so far? In October 2018 (!), shortly after the release of Windows 10 1809, a thread came up in the VMTN (windows 10 1809 slow). According to the posted test results, the issue occurs under the following conditions.

  • Windows 10 1809
  • VMware ESXi 6.5 or 6.7 (regardless from build level)
  • VM has at least one snapshot
  • VM is placed on a VMFS 6 datastore
  • Space reclamation is enabled or disabled

The “official” statement of the VMware support is:

The issue is identified to be due to some guest OS behavior change in this version of windows 10, 1809 w.r.t thin provisioned disks and snapshots, this has been confirmed as a bug and will be fixed in the following releases – 6.5 U3 and 6.7U3, which will be released within End of this year (2019).

https://communities.vmware.com/message/2848206#2848206

I don’t care if the root cause is VMFS 6 or Windows 10. But VMware and Microsoft needs to get this fixed fast! Just to make this clear: You will face the same issues, regardless if you run Windows 10 in a VM, use Windows 10 with Horizon View, or Windows 10 with Citrix. When VMFS 6 and Snapshots comes into play, you will ran into this performance issue.

I will update this blog post when I get some news.

“Cannot execute upgrade script on host” during ESXi 6.5 upgrade

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I was onsite at one of my customers to update a small VMware vSphere 6.0 U3 environment to 6.5 U2c. The environment consists of three hosts. Two hosts in a cluster, and a third host is only used to run a HPE StoreVirtual Failover Manager.

The update of the first host, using the Update Manager and a HPE custom ESX 6.5 image, was pretty flawless. But the update of the second host failed with “Cannot execute upgrade script on host”

typographyimages/ pixabay.com/ Creative Commons CC0

I checked the host and found it with ESXi 6.5 installed. But I was missing one of the five iSCSI datastores. Then I tried to patch the host with the latest patches and hit “Remidiate”. The task failed with “Cannot execute upgrade script on host”. So I did a rollback to ESXi 6.0 and tried the update again, but this time using ILO and the HPE custom ISO. But the result was the same: The host was running ESXi 6.5 after the update, but the upgrade failed with the “Upgrade Script” error. After this attempt, the host was unable to mount any of the iSCSI datastores. This was because the datastores were mounted ATS-only on the other host, and the failed host was unable to mount the datastores in this mode. Very strange…

I checked the vua.log and found this error message:

Focus on this part of the error message:

The upgrade script failed due to an illegal character in the output of esxcfg-info. First of all, I had to find out what this 0x80 character is. I checked UTF-8 and the windows1252 encoding, and found out, that 0x80 is the € (Euro) symbol in the windows-1252 encoding. I searched the output of esxcfg-info for the € symbol – and found it.

But how to get rid of it? Where does it hide in the ESXi config? I scrolled a bit up and down around the € symbol. A bit above, I found a reference to HPE_SATP_LH . This took immidiately my attention, because the customer is using StoreVirtual VSA and StoreVirtual HW appliances.

Now, my second educated guess of the day came into play. I checked the installed VIBs, and found the StoreVirtual Multipathing Extension installed on the failed host – but not on the host, where the ESXi 6.5 update was successful.

I removed the VIB from the buggy host, did a reboot, tried to update the host with the latest patches – with success! The cross-checking showed, that the € symbol was missing in the esxcfg-info  output of the host that was upgraded first. I don’t have a clue why the StoreVirtual Multipathing Extension caused this error. The customer and I decided to not install the StoreVirtual Multipathing Extension again.

Powering on a VM with shared VMDK fails after extending a EagerZeroedThick VMDK

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I hope that you are not reading this blog post while searching for a solution for a failed cluster. If so, feel free to leave a comment if this blog post saved your evening or weekend. :)

Last friday, a change at one of my customers went horribly wrong. I was not onsite, but they contacted me during the night from friday to saturday, because their most important Windows Server Failover Cluster was unable to start after extending a shared VMDK.

cripi/ pixabay.com/ Creative Commons CC0

They tried something pretty simple: Extending an virtual disk of a VM. That is something most of us doing pretty often. The customer did this also pretty often. It was a well known task… Except the fact, that the VM was part of a Windows Server Failover Cluster. With shared VMDKs. And the disks were EagerZeroedThick, because this is a requirement for shared VMDKs.

They extended the disk using the vSphere Web Client. And at this point, the change was doomed to fail. They tried to power-on the VMs, but all they got was this error:

VMware ESX cannot open the virtual disk, “/vmfs/volumes/4c549ecd-66066010-e610-002354a2261b/VMNAME/VMDKNAME.vmdk” for clustering. Please verify that the virtual disk was created using the ‘thick’ option.

A shared VMDK is a VMDK in multiwriter mode. This VMDK has to be created as Thick Provision Eager Zeroed. And if you wish to extend this VMDK, you must use  vmkfstools  with the option -d eagerzeroedthick. If you extend the VMDK using the Web Client, the extended portion of the disk will become LazyZeroed!

VMware has described this behaviour in the KB1033570 (Powering on the virtual machine fails with the error: Thin/TBZ disks cannot be opened in multiwriter mode). There is also a blog post by Cormac Hogan at VMware, who has described this behaviour.

That’s a screenshot from the failed cluster. Check out the type of the disk (Thick-Provision Lazy-Zeroed).

Patrick Terlisten/ vcloudnine.de/ Creative Commons CC0

You must use vmkfstools  to extend a shared VMDK – but vmkfstools is also the solution, if you have trapped into this pitfall. Clone the VMDK with option -d eagerzeroedthick.

Another solution, which was new to me, is to use Storage vMotion. You can migrate the “broken” VMDK to another datastore and change the the disk format during Storage vMotion. This solution is described in the “Notes” section of KB1033570.

Both ways will fix the problem. The result will be a Thick Provision Eager Zeroed VMDK, which will allow the VMs to be successfully powered on.

Veeam backups fails because of time differences

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week I had an interesting incident at a customer. The customer reported that one of multiple Veeam backup jobs jobs constantly failed.

jarmoluk/ pixabay.com/ Creative Commons CC0

The backup job included two VMs, and the backup of one of these VMs failed with this error:

The verified the used credentials for that job, but re-entering the password does not solved the issue. I then checked the Veeam backup logs located under %ProgramData%\Veeam\Backup (look for the Agent.Job_Name.Source.VM_Name.vmdk.log) and found VDDK Error 3014:

The user, that was used to connect to the vCenter, was an Active Directory located account. The account were granted administrator privileges root of the vCenter. Switching from an AD located account to Administrator@vsphere.local solved the issue. Next stop: vmware-sts-idmd.log on the vCenter Server appliance. The error found in this log confirmed my theory, that there was an issue with the authentication itself, not an issue with the AD located account.

To make a long story short: Time differences. The vCenter, the ESXi hosts and some servers had the wrong time. vCenter and ESXi hosts were using the Domain Controllers as time source.

This is the ntpq  output of the vCenter. You might notice the jitter values on the right side, both noted in milliseconds.

After some investigation, the root cause seemed to be a bad DCF77 receiver, which was connected to the domain controller that was hosting the PDC Emulator role. The DCF77 receiver was connected using an USB-2-LAN converter. Instead of using a DCF77 receiver, the customer and I implemented a NTP hierarchy using a valid NTP source on the internet (pool.ntp.org).

Unsupported hardware family ‘vmx-06’

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer of mine got an appliance from a software vendor. The appliance was delivered as ZIP file with a VMDK, a MF, and an OVF file. Unfortunately, the appliance was created with VMware Workstation 6.0 with virtual machine hardware version 6, which is incompatible with VMware ESXi (Virtual machine hardware versions). During deployment, my customer got this error:

The OVF file includes a line with the VM hardware version.

If you change this line from vmx-06 to vmx-07, the hash of the OVF changes, and you will get an error during the deployment of the appliance because of the wrong file hash.

Solution

You have to change the SHA256 hash of the OVF, which is included in the MF file.

To create the new SHA256 hash, you can use the PowerShell cmdlet Get-FileHash .

Replace the hash and save the MF file. Then re-deploy the appliance.

Andreas Lesslhumer wrote a similar blog post in 2015:
“Unsupported hardware family vmx-10” during OVF import