Tag Archives: vsphere

VMware vCenter 7.0 U2 deployment fails at stage 2

Today I had to deploy a new vCenter appliance. Nothing fancy, new deployment. Stage 1 was easy, but stage 2 failed several times. I re-deployed the vCenter appliance two times, but as the deployment failed for the third time, I took a look into the logs.

The deployment failed without any error, but it didn’t finished. It stopped during the start of different services without any error.

First of all: Log into the appliance using SSH or the console. Use the root account and the root password you have entered during the setup.

A good point to start are the logs under /var/log/firstboot. I used ls -lt to get the last written logs. Most services will write two logs: One log ends with _stdout.log, and the second one will end with _stderr.log. The _stdout.log contails the log messages of the service. The _stderr.log contains the errors. I searched for a service that has written to a _stderr.log – and I found it: scafirstboot.py_10507_stderr.log.

And this log gave me a hint what the root cause was. One of the last log entries was:

ssl.SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: certificate is not yet valid

What what? A certificate not only has an end date, but also a date before which it is not valid – a start date. And this is often indicates a problem with – NTP. And it was NTP. I have configured NTP for the vCenter, but not for the ESXi on which I deployed the vCenter. -.- If it is not DNS, it’s NTP. Or a invalid certificate. Or both.

vCenter Server migration to 6.7 fails with “Failed to check VMware STS. The SSL certificate of STS service cannot be verified”

There are still customers out there that are running vCenter Server on a Windows host. This year, despite the fact that most customers have set project on hold, I managed some of them to migrate to a vCenter Server Appliance.

Some days ago I had an meeting with one of my favorite customers to migrate their vCenter Server 6.5 to a vCenter Server Appliance 6.7 U3l. They were still on 6.5 because of some legacy ESXi 5.5 hosts, but they managed it to remove them from their vCenter and we were able to start the migration.

Healthcheck & Stage 1

We did a healthcheck the day before, so we pretty sure that everything should will went smooth. Stage 1 was easy. We deployed the appliance (X-Large… *cough* *cough*) and moved to stage 2. We had ~20 GB of data to migrate. IMHO nothing fancy, I had vCenter withs more data to migrate.

Stage 2… and FAIL

To make a long story short: Stage 2 failed pretty hard with an unrecoverable error.

Encountered an internal error. Traceback (most recent call last): File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 1641, in main vmidentityFB.boot() File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 345, in boot self.checkSTS(self.__stsRetryCount, self.__stsRetryInterval) File "/usr/lib/vmidentity/firstboot/vmidentity-firstboot.py", line 1179, in checkSTS raise Exception('Failed to initialize Secure Token Server.') Exception: Failed to initialize Secure Token Server.

Deep dive into the log files

The messaged indicated a problem with the Secure Token Server (STS). Without any troubleshooting we did a second try, which also fails with the same message. Time to dive into the logs…

The log file of vmidentity-firstboot.py was pretty helpful, because the end of the log file pointed us to the right direction.

Failed to check VMware STS.
com.vmware.vim.sso.client.exception.CertificateValidationException: The SSL certificate of STS service cannot be verified

This message led us to VMware KB76144 (“Failed to check VMware STS. The SSL certificate of STS service cannot be verified” while upgrading VCSA from 6.5 to 6.7/7.0)

According to the KB article, the cause for our problem is a certificate in STS_INTERNAL_SSL_CERT store which is used by the STS. For sure: This vCenter was upgraded from 5.5 at some time in the past.

So we checked the certificate stores and found further evidence, that a certificate seemed to be our main problem. As you can see, this certificate from the STS_INTERNAL_SSL_CERT store was expired some days ago.

Fortunately, KB76144 offers a simple solution to this problem. In short:

  • remove certificates from the STS_INTERNAL_SSL_CERT store, and
  • re-import the certificate from the MACHINE_SSL_CERT store

It’s DNS… or NTP… or a expired certificate

Because we had a Windows-based vCenter, we had to modify the commands from the KB article for Windows.

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry getcert --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output c:\temp\machine_ssl.crt

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry getkey --store MACHINE_SSL_CERT --alias __MACHINE_CERT --output c:\temp\machine_ssl.key

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry getcert --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --output c:\temp\STS_INTERNAL_SSL_CERT-__MACHINE_CERT.crt

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry getkey --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --output c:\temp\STS_INTERNAL_SSL_CERT-__MACHINE_CERT.key

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry delete --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT -y
Deleted entry with alias [__MACHINE_CERT] in store [STS_INTERNAL_SSL_CERT] successfully

C:\Program Files\VMware\vCenter Server\vmafdd>vecs-cli entry create --store STS_INTERNAL_SSL_CERT --alias __MACHINE_CERT --cert c:\temp\machine_ssl.crt --key c:\temp\machine_ssl.key
Entry with alias [__MACHINE_CERT] in store [STS_INTERNAL_SSL_CERT] was created successfully

The last step was to restart the VMware STS service. After this, we tried the migration again and the migration went smooth.

Adobe Flash will die and how does this affects VMware

December 31, 2020 will not only be the end of the miserable year 2020, it will also be the end of an era – the era of Adobe Flash! Adobe has announced that they will stop supporting Adobe Flash after December 31, 2020. Furthermore, Adobe will block Flash from running in Flash Player on January 12, 2021. Adobe strongly recommends that all users immediately uninstall Flash Player. I got a popup a couple of times, asking me if I want to uninstall Adobe Flash. It’s still installed… :/

Adobe Flash isn’t a big thing in web development anymore, but there is a reason why I still have Adobe Flash installed – Admin Interfaces!

Source: 9GAG

We all had to deal with Flash after VMware started with the vSphere Web Client. It was slow and partially painful buggy. New newer HTML5 based Web Client was much better, but not feature complete until vSphere 6.7.

But the vSphere Web Client was not the only admin interface based on Flash used in a VMware product. The Horizon Administrator, which was the main administration interface until Horizon 7.8, is also based on Flash. Or vRealize Operations uses Flash until version 6.6.

Update now!

If you want to remove Adobe Flash from your computer, you have to update your whole, or at least parts, of your VMware infrastructure.

The simple rule is: Update to the latest release and everything will be fine. If you are running vSphere 6.7 U3, the HTML5 based Web Client is feature complete. The same applies to Horizon View. If you are running 7.10 or a newer release, everything is fine.

VMware has published and KB article which summarizes the update paths: VMware Flash End of Life and Supportability (78589).

But what if I can’t/ or I’m unwilling to update?

In this case, there is an easy approach: Disconnect your systems from the internet or at least block the internet access for them. The alternative approach is not recommended! Stop the automatic updates on your web browser and use the Flash-based User Interfaces on a browser which still supports Flash. Again: This is really not recommended!

Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I’ve got several mails and comments about this topic. It looks like that the latest ESXi 6.7 updates are causing some trouble on HPE ProLiant Gen10 servers.

I’ve blogged about recurring host hardware sensor state alarm messages some weeks ago. A customer noticed them after an update. Last week, I got the first comments under this blog post abot fan failure messages after applying the latest ESXi 6.7 updates. Then more and more customers asked me about this, because they got these messages too in their environment after applying the latest updates.

Last Saturday I tweeted my blog post to give a hint to my followers who may be experiencing the same problem.

Fortunately one of my followers (Thanks Markus!) pointed me to a VMware KB article with a workaround: Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7 (78989).

This is NOT a solution, but a workaround. Keep that in Mind.

Thanks again to Markus. Make sure to visit his awesome blog (MY CLOUD-(R)EVOLUTION) , especially if you are interested in vSphere, Veeam and automation!

Space reclamation of VMFS 5 Datastores using esxcli

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

It was a bit quiet here in January caused by a new “private project” which has attracted some resources, and will pull more resources in the future.

But this will not stop me from documenting useful stuff. This one is nothing new, but commonly asked by some customers: How do I get my storage capacity back after deleting VMs?!

The outlined steps are all done using esxcli. You need to execute them on a single ESXi host, not on each host in the cluster.

Connect to one of your ESXi hosts using SSH. You can use this small PowerCLI command to enable SSH on a specific host.

Get-VMHost esx1.lab.local | Get-VMHostService | Where Key -EQ "TSM-SSH" | Start-VMHostService 

The first step is to identify the datastore(s) from which you want to reclaim storage.

[[email protected]:~] esxcli storage vmfs extent list
 Volume Name    VMFS UUID                            Extent Number  Device Name                           Partition
 -------------  -----------------------------------  -------------  ------------------------------------  ---------
 VMDS01         55dc0522-c72eebec-3780-d89d672d7a3c              0  naa.60030d90eca17602ce5c5a54a083e31c          1

We will need the device name, and later the UUID. The next step is to identify if the device is detected as a thin-provisioned disk, and if it is VAAI-capable. I’ve shortened the output of the esxcli output to the necessary output.

[[email protected]:~] esxcli storage core device list -d naa.60030d90eca17602ce5c5a54a083e31c
    Thin Provisioning Status: yes
    VAAI Status: supported

No we have to verify if all necessary VAAI options are supported.

[[email protected]:~] esxcli storage core device vaai status get -d naa.60030d90eca17602ce5c5a54a083e31c
 naa.60030d90eca17602ce5c5a54a083e31c
    VAAI Plugin Name:
    ATS Status: supported
    Clone Status: supported
    Zero Status: supported
    Delete Status: supported

Important for us is the “Delete” primitive. If this is supported, we can use UNMAP to reclaim storage.

[[email protected]:~] esxcli storage vmfs unmap -u 55dc0522-c72eebec-3780-d89d672d7a3c

This process will take some time depending on the amount of storage that has to be reclaimed. And it will put some load on your storage, so you might want to run this in a less productive time.

Why we need a vSAN licensing for SMB customers

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Not every customer is running a full-blown vSphere Enterprise Plus licensing. To be honest, when I look at the number of sold licenses, most of my customers are running vSphere Essentials Plus. Not Essentials, nor Standard or Enterprise (Plus), but two or three hosts with Essentials Plus. And that’s perfectly fine!

Two or three hosts with 10 GbE and pretty often 12G SAS. Some of them with Fibre-Channel, nearly no one with iSCSI. My colleagues and I developed a pretty rock solid setup over the last years, which we sell like some kind of building block: HPE ProLiant, HPE MSA, Aruba Switches, vSphere Essentials Plus. A perfect setup for most of our customers, which run something between 10 and 30 VMs on it. Some of them also add Horizon View (Add-On) to it.

But requirements change. More customers ask for more hosts. When customers break out of the Essentials Plus licensing, then often because of the host limitation. Less of them do this because they need DRS or even Storage vMotion.

Some of my customers have heard about vSAN and they like the idea behind it. Especially when you take into account, that hardware costs decrease and flash storage is getting cheaper. But when you discuss the idea of combining vSAN and Essentials licensing, you will hit the host limitation early.

VMware itself states in the vSAN licensing guide:

The 2-node vSAN deployment model is not restricted to a specific vSAN license edition. In other words, any of the licensing editions can be used with a 2-host configuration. vSphere Essentials Kit or vSphere Essentials Plus Kit licensing limits the number of hosts managed by
vCenter Server Essentials to three. The vSAN witness host – virtual appliance or physical – is considered a host in these Essentials licensing bundles.

Source: VMware vSAN Licensing Guide

When you take a look at the Horizon Desktop licensing, or at the RoBo licensing, you will see another kind of limitation: Limiting the number of VMs, not the number of hosts. This is pretty interesting when you think about combining vSAN and Essentials licensing.

Why not offering a “HCI Essentials Kit” limitied to 25 VMs, and the features offered by Essentials Plus and vSAN Standard? This would allow customers to run four or five hosts with vSAN. By limiting the number of VMs, customers can scale-out their infrastructure in terms of capacity.

Hey VMware, you might think about this over the Christmas holiday. ;) There is a customer segment that is not yet sufficiently addressed by your sales team. This is a chance for more YoY growth. ;)

vCenter Migration from 6.0 to 6.7 fails due to missing user role

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Actually, yesterday should be the day at which I migrate one of the last physical Windows vCenter servers installed in my customer base. Actually… the migration failed twice. And each time I had to rollback, power-on the old physical server, reset the computer account etc.

The update was from VMware vCenter Server 6.0 Update 3d (7462484) on a Windows 2012 R2 server to vCenter Server 6.7 Update 3 (Appliance). The migration failed at 62% with the following message:

Traceback (most recent call last):
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 219, in Main
    vdc_fb.register_cis()
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 77, in register_cis
    self._reg_info.registerAll(self.get_soluser_id(), self.get_soluser_ownerId())
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 368, in registerAll
    self.registerUserAndService(user_name, user_id, service)
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 395, in registerUserAndService
    add_vmtx_privileges(self.vdc_cfg_dir)
  File "/usr/lib/vmware-content-library/install_lib/add_vmtx_privileges_after_fb.py", line 105, in add_vmtx_privileges
    log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))
AttributeError: 'NoneType' object has no attribute 'name'

I found the same error in the content-library-firstboot.py_9150_stderr.log file of the downloaded log bundle.

Okay, that’s a pretty long error message and I had no idea where I should start searching. But it seems related to the Content Library of the vCenter. And it looks like it is related to the privileges.

log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))

A forum post led me to the content library administrator role. The author had to deal with a failed migration (6.5 to 6.7), but his conten administrator role was missing. In my case, the role was existent.

Sorry for the german translation. As you can see, the role was existent… Obviously. I tried to add a new role with the name com.vmware.Content.Admin, as mentioned in the forum post, and… a new role appeared.

You might notice the “Beispiel” or “Example”. That’s the difference. Whatever the other role is or what its look like, it is definitely not the original content library administrator role.

And to make a long story short: The migration was successful after this small change.

User vdcs does not have the expected uid 1006

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Sorry for the long delay since my last blog post – busy times, but with lots of vSphere. :) Today, I did an upgrade of a standalone vCenter Server Appliance at one of my healthcare customers. The vCenter was on 6.0 U3 and I had to upgrade it to 6.7 U2. It was only a small deployment with three hosts, so nothing fancy. And as with in many other vSphere upgrades, I came across this warning message:

Warning User vdcs does not have the expected uid 1006
Resolution Please refer to the corresponding KB article.

I saw this message multiple times, but in the past, there was no KB article about this, only a VMTN thread. And this thread mentioned, that you can safely ignore this message, if you don’t use a Content Library. Confirmation enough to proceed with the upgrade. :)

Meanwhile, there is a KB article:

Uploading content to the library fails with error: Content Library Service does not have write permission on this storage backing (52559)

This is a statement from the KB article:

Note: You can safely ignore this message if you are not using Content Library Service before the upgrade, or using it only for libraries not backed by NFS storage.

Currently, I don’t have cusomters with NFS backed Content Libraries, but if you do, you might want to take a look at it. Especially if you have done an upgrade from 6.0 to 6.5 or 6.7 and you want to start using Content Libraries now.

Poor performance with Windows 10/ 2019 1809 on VMFS 6

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

THIS IS FIXED in ESXi 6.5 U3 and 6.7 U3.

See KB67426 (Performance issues with Windows 10 version 1809 VMs running on snapshots) for more information.

TL;DR: This bug is still up to date and has not been fixed yet! Some user in the VMTN thread mentioned a hotpatch from VMware, which seems to be pulled. A fix for this issue will be available with ESXi 6.5 U3 and 6.7 U3. The only workaround is to place VMs on VMFS 5 datastores, or avoid the use of snapshots if you have to use VMFS 6. I can confirm, that Windows 1903 is also affected.

One of my customers told me that they have massive performance problems with a Horizon View deployment at one of their customers. We talked about this issue and they mentioned, that this was related to Windows 10 1809 and VMFS 6. A short investigation showed, that this issue was well known, and even VMware is working on this. In their case, another IT company installed the Cisco HyperFlex solution and the engineer was unaware of this issue.

Image by Manfred Antranias Zimmer from Pixabay

What do we know so far? In October 2018 (!), shortly after the release of Windows 10 1809, a thread came up in the VMTN (windows 10 1809 slow). According to the posted test results, the issue occurs under the following conditions.

  • Windows 10 1809
  • VMware ESXi 6.5 or 6.7 (regardless from build level)
  • VM has at least one snapshot
  • VM is placed on a VMFS 6 datastore
  • Space reclamation is enabled or disabled

The “official” statement of the VMware support is:

The issue is identified to be due to some guest OS behavior change in this version of windows 10, 1809 w.r.t thin provisioned disks and snapshots, this has been confirmed as a bug and will be fixed in the following releases – 6.5 U3 and 6.7U3, which will be released within End of this year (2019).

https://communities.vmware.com/message/2848206#2848206

I don’t care if the root cause is VMFS 6 or Windows 10. But VMware and Microsoft needs to get this fixed fast! Just to make this clear: You will face the same issues, regardless if you run Windows 10 in a VM, use Windows 10 with Horizon View, or Windows 10 with Citrix. When VMFS 6 and Snapshots comes into play, you will ran into this performance issue.

I will update this blog post when I get some news.

“Cannot execute upgrade script on host” during ESXi 6.5 upgrade

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I was onsite at one of my customers to update a small VMware vSphere 6.0 U3 environment to 6.5 U2c. The environment consists of three hosts. Two hosts in a cluster, and a third host is only used to run a HPE StoreVirtual Failover Manager.

The update of the first host, using the Update Manager and a HPE custom ESX 6.5 image, was pretty flawless. But the update of the second host failed with “Cannot execute upgrade script on host”

typographyimages/ pixabay.com/ Creative Commons CC0

I checked the host and found it with ESXi 6.5 installed. But I was missing one of the five iSCSI datastores. Then I tried to patch the host with the latest patches and hit “Remidiate”. The task failed with “Cannot execute upgrade script on host”. So I did a rollback to ESXi 6.0 and tried the update again, but this time using ILO and the HPE custom ISO. But the result was the same: The host was running ESXi 6.5 after the update, but the upgrade failed with the “Upgrade Script” error. After this attempt, the host was unable to mount any of the iSCSI datastores. This was because the datastores were mounted ATS-only on the other host, and the failed host was unable to mount the datastores in this mode. Very strange…

I checked the vua.log and found this error message:

2018-11-05T16:35:56.614Z info vua[A3CAB70] [[email protected] sub=VUA] Command '/tmp/vuaScript-xMVUfb/precheck.py --ip=172.19.0.14' finished with exit status 1
--> stderr: --------
--> INFO:root:Running esxcfg-info
--> Traceback (most recent call last):
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 385, in run
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/subprocess.py", line 788, in communicate
-->   File "/build/mts/release/bora-9298722/bora/build/esx/release/vmvisor/sys-boot/lib64/python3.5/encodings/ascii.py", line 26, in decode
--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

Focus on this part of the error message:

--> UnicodeDecodeError: 'ascii' codec can't decode byte 0x80 in position 1272423: ordinal not in range(128)

The upgrade script failed due to an illegal character in the output of esxcfg-info. First of all, I had to find out what this 0x80 character is. I checked UTF-8 and the windows1252 encoding, and found out, that 0x80 is the € (Euro) symbol in the windows-1252 encoding. I searched the output of esxcfg-info for the € symbol – and found it.

            \==+Heap : 
               |----Name............................................€A
               |----Growable........................................true
               |----Max Size........................................41848 bytes
               |----Max Available...................................40816 bytes
               |----Current Size....................................29560 bytes
               |----Current Size....................................29560 bytes
               |----Current Allocation..............................1032 bytes
               |----Current Available...............................1032 bytes
               |----Current Releasable..............................20400 bytes
               |----Percent Free of Current.........................96 
               |----Percent Free of Max.............................97 
               |----Percent Releasable..............................69

But how to get rid of it? Where does it hide in the ESXi config? I scrolled a bit up and down around the € symbol. A bit above, I found a reference to HPE_SATP_LH . This took immidiately my attention, because the customer is using StoreVirtual VSA and StoreVirtual HW appliances.

Now, my second educated guess of the day came into play. I checked the installed VIBs, and found the StoreVirtual Multipathing Extension installed on the failed host – but not on the host, where the ESXi 6.5 update was successful.

I removed the VIB from the buggy host, did a reboot, tried to update the host with the latest patches – with success! The cross-checking showed, that the € symbol was missing in the esxcfg-info  output of the host that was upgraded first. I don’t have a clue why the StoreVirtual Multipathing Extension caused this error. The customer and I decided to not install the StoreVirtual Multipathing Extension again.