Tag Archives: vExpert

Wrong iovDisableIR setting on ProLiant Gen8 might cause a PSOD

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

TL;DR: There’s a script at the bottom of the page that fixes the issue.

Some days ago, this HPE customer advisory caught my attention:

Advisory: (Revision) VMware – HPE ProLiant Gen8 Servers running VMware ESXi 5.5 Patch 10, VMware ESXi 6.0 Patch 4, Or VMware ESXi 6.5 May Experience Purple Screen Of Death (PSOD): LINT1 Motherboard Interrupt

And there is also a corrosponding VMware KB article:

ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

It isn’t clear WHY this setting was changed, but in VMware ESXi 5.5 patch 10, 6.0  patch 4, 6.0 U3 and, 6.5 the Intel IOMMU’s interrupt remapper functionality was disabled. So if you are running these ESXi versions on a HPE ProLiant Gen8, you might want to check if you are affected.

To make it clear again, only HPE ProLiant Gen8 models are affected. No newer (Gen9) or older (G6, G7) models.

Currently there is no resolution, only a workaround. The iovDisableIR setting must set to FALSE. If it’s set to TRUE, the Intel IOMMU’s interrupt remapper functionality is disabled.

To check this setting, you have to SSH to each host, and use esxcli  to check the current setting:

I have written a small PowerCLI script that uses the Get-EsxCli cmdlet to check all hosts in a cluster. The script only checks the setting, it doesn’t change the iovDisableIR setting.

Here’s another script, that analyzes and fixes the issue.

Creating console screenshots with Get-ScreenshotFromVM.ps1

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, I had a very interesting discussion. As part of an ongoing troubleshooting process, console screenshots of virtual machines should be created.

The colleagues, who were working on the problem, already found a PowerCLI script that was able to create screenshots using the Managed Object Reference (MoRef). But unfortunately all they got were black screens and/ or login prompts. Latter were the reason why they were unable to run the script unattended. They used the Get-VMScreenshot script, which was written by Martin Pugh.

I had some time to take a look at his script and I created my own script, which is based on his idea and some parts of his code.

This file is also available on GitHub.

One important note: If you want to take console screenshots of VMs, please make sure that display power saving settings are disabled! Windows VMs are showing a black screen after some minutes. Please disable this using the energy options, or better using a GPO. Otherwise you will capture a black screen!

vExpert 2017 – My 2 cents about the increasing number of vExperts

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last Wednesday, VMware has published a list with the vExperts for 2017.

I’m on this list. I’m on this list for the fourth time, which makes me very happy and proud. I was surprised that I’m on this list. I have written only a few blog posts last year. I sometimes tweet about VMware, and I am active in some forums. The focus of this blog has shifted.

Are there too many vExperts?

Eric Siebert from vsphere-land.com wrote a blog post about the vExpert announcement (vExpert 2017 announced and there are still too many vExperts & vExpert class of 2016 announced – are there too many vExperts). Eric thinks that VMware makes it to easy to be a vExpert. There is no definition what it means to significantly contribute to the community.

Yes, Eric is right. The criteria are very “spongy”. And that is a problem. But it is not the problem. When the VMware community grows, the number of vExperts increase automatically.

Betteridge’s law of headlines – the answer is always no

Look at other community programs (Veeam Vanguard, Cisco Champion, PernixPro, Microsoft MVP etc.). These community programs were designed to reward individuals that have highly contributed to the community. These awards motivate individuals to contribute to the community. And if individuals contribute significantly, they are awarded for this. The increasing number of awarded individuals will motivate more community members to contribute. More individuals will be awarded. It’s a self-sustaining process. A process that will help your community to grow.

When you design such a community program, you don’t want to have a small elite group of individuals. You want that your community grows. But you must use the right criterias, and you must held the level high enough. You must not reward anyone, that missed the criteria. Otherwise the title becomes inflationary and loses its importance and reputation.

VMware should work on these criterias. They should rise the bar. But they should make the criterias and the election process transparent for everyone.

Horizon View: Server certificate does not match the external url

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Horizon View Certificate Error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Certificates are always fun… or should I say PITA?  Whatever… During a small Horizon View PoC, I noticed an error message for the View Connection Server.

That’s right, Mr. Connection Server. The certificate subject name does not match the servers external URL, as this screenshot clearly shows.

Horizon View Secure Tunnel Settings

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

But both settings are unused, because a VMware Access Point appliance is in place. If I remove the certificate, that was issued from a public certificate authority, I get an error message because of an invalid, self signed certificate.

I want to use the certificate on the Horizon View Connection Server, but I also want to get rid of the error message, caused by the wrong subject name. The customer uses split DNS, so he is using the same URL internally and externally, and the certificate uses the external URL as subject name.

Change the URLs

The solution is easy:

  1. Enable the checkboxes for the Secure Tunnel connection and the Blast Secure Gateway
  2. Change the hostname to the name, that matches the subject name of the certificate
  3. Uncheck the checkboxes again, and apply the settings
Horizon View Secure Tunnel Settings

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

After a couple of secons, and a refresh of the dashboard, the error for the Connection Server should be gone.

Horizon View Connection Server Details

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

VMware EUC Access Point appliance – Name resolution not working after deployment

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a project, I had to deploy a VMware EUC Access Point appliance. Nothing fancy, because the awesome VMware Access Point Deployment Utility makes it easy to deploy.

Unfortunately, the deployed Access Point appliance was not working as expected. When I tried to access my Horizon View infrastructure behind the Access Point appliance, I got a HTTP 504 error. The REST API interface was working. I was able to exclude invalid certificates, routing, or firewall policies. I re-deployed the appliance using the the IP address of the connection server, instead of the FQDN. And this worked… I checked the name resolution with nslookup and the name resolution failed. So that was probably the problem.

One per line

To make a long story short: The DNS server, I entered in the VMware Access Point Deployment Utility, were added in a single line to the /etc/resolv.conf

This is wrong, even if the VMware Access Point Deployment Utility claims something different.

euc_deployment_dns

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

There must be a single “nameserver” entry for each DNS server.

You can easily change this after the deployment. Add only one DNS server during the deployment, and then add the second DNS server after the deployment.

I would like to highlight, that Chris Halstead mentioned this behaviour a year ago in his blog post “VMware Access Point Deployment Utility“. Chris is the author of the Deployment Utility.

Why I moved from NFS to vSAN… and why it went wrong

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I wanted to retire my Synology DS414slim, and switch completely to vSAN. Okay, no big deal. Many folks use vSAN in their lab. But I’d like to explain why I moved to vSAN and why this move failed. I think some of my thoughts are also applicable for customer environments.

So far, I used a Synology DS414slim with three Crucial M550 480 GB SSDs (RAID 5) as my main lab storage. The Synology was connected with two 1 GbE uplinks (LAG) to my  network, and each host was connected with 4x 1 GbE uplinks (single distributed vSwitch). The Synology was okay from the capacity perspective, but the performance was horrible. RAID 5, SSDs and NFS were not the best team, or to be precise, the  CPU of the Synology was the main bottleneck.

nas_ds414slim

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

1,2 GHz is not enough, if you want to use NFS or iSCSI. I never got more than 60 MB/s (sequential). The random IO performance was okay, but as soon as the IO increased, the latencies went through the roof. Not because the SSDs were to slow, but because the CPU of the Synology was not powerful enough to handle the NFS requests.

Workaround: Add more flash storage

The workaround for the poor random IO performance was adding more flash storage. This time, the flash storage was added to the hosts. I used PernixData FVP to boost my lab. FVP was a quite cool product (unfortunately it was a cool product.) PernixData granted me, as a PernixPro, some licenses for my lab.

End of an era

The acquisition of PernixData by Nutanix, the missing support für vSphere 6.5, and the end of availability of all PernixData products led to the decision to remove PernixData FVP from my lab. Without PernixData FVP, my lab was again a slow train crawling up a hill. Four HPE ProLiant, with enough CPU (40 cores) and memory resources (384 GB RAM) were tied down by slow IO.

Redistribution of resources

I had

  • three 480 GB SSDs, and
  • three 40 GB SSDs

in stock. The 40 GB SSDs were to small and slow, so I replaced them with 120 GB SSDs. I was able to equip three of my four hosts with SSDs. Three hosts with flash storage were enough to try VMware vSAN.

Fortunately, not all hosts have to add capacity to a vSAN cluster. Hosts can also only consume storage from a vSAN cluster. With this in mind, vSAN appeared to be a way out of my IO dilemma. In addition, using the 480 GB SSDs as capacity tier, a vSAN all-flash config was possible.

Migration

It took me a little time to move around VMs to temporary locations, while keeping my DC and my VCSA available. I had to remove my datastore on the Synology to free up the 480 GB SSDs. The necessary vSAN licenses were granted by VMware (vExpert licenses).

The creation of the vSAN cluster itself was easy. Fortunately, wiping partitions from disks is easy. You can use the vSphere Web Client to do this.

vsan_wipe_partitions

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The initial performance was quite good, much better than expected and much better than the NFS performance of the old Synology NAS. I enabled deduplication and compression, but as soon as I moved VMs to the vSAN datastore, the throughput dropped and latencies went through the roof. It was totally unusable. Furthermore, I got health alarms:

vsan_congestion_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

As the load increased, the errors became more severe.

vsan_performance_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0


I was able to solve this with a blog post of Cormac Hogan (VSAN 6.1 New Feature – Handling of Problematic Disks). Even without compression and deduplication, the performance was not as expected and most times to low to work with. At this point, I got an idea what was causing my vSAN problems.

Do not use consumer-grade hardware with vSAN

To be honest: The budget is the problem. I had to take consumer-grade SSDs.

This is a screenshot from the vSAN Observer. esx1 to esx3 are equipped with SSDs, esx4 is only consuming storage from the vSAN cluster.

vsan_observer_perf

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Red is not the color to highlight good things…

An explanation attempt

This blog post of Duncan Epping (Why Queue Depth matters!) is a bit older, but still valid in my case. The controller I use  (HPE Smart Array P410i) has a a deep queue (1011), the RAID device has a queue length of 1024, but the SATA SSDs have only a queue length of 32. Here’s the disk adapter and disk device view of ESXTOP.

The consumer-grade SSDs drowned in IOs, unable to handle parallel read and write operations. There’s nothing much that I can do. Currently there are two options:

  • Replacing the SSDs with devices, that have a deeper queue depth
  • Replace the Synology NAS with a more powerful NAS and move back to NFS

I don’t know which way I will go. To get this clear:

  • This is my lab, not a customer environment
  • It is not a vSAN related problem
  • It is because of consumer-grade hardware

Do not try this at production kids. Go vSAN, but please use the right hardware.

Replacing an expired lookup service SSL certificate on a vSphere PSC

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A few days ago, I ran into a very nasty problem. Fortunately, it was in my lab. Some months ago, I replaced the certificates of my vCenter Server Appliance (VCSA), and I’ve chosen to use the VMware Certificate Authority (VMCA) as a subordinate of my AD-based enterprise CA. The VMCA was used as intermediate CA. The certificates were replaced using the  vSphere 6.0 Certificate Manager (/usr/lib/vmware-vmca/bin/certificate-manager), and I followed the instructions of KB2112016 (Configuring VMware vSphere 6.0 VMware Certificate Authority as a subordinate Certificate Authority).

The VCSA was migrated from vSphere 5.5, and with vSphere 5.5 I was also using custom certificates. These certificates were also issued by my AD-based enterprise CA, and these certificates were migration during the vSphere 5.5 > 6.0 migration. So at the end, I replaced custom certificates with VMCA (as an intermediate CA) certificates.

Everything was fine, until a power outage. After powering-on my VMs, I noticed several errors. After logging into the vSphere Web Client, I got an error message at the top of the page:

While searching for the cause, I checked the URL of the Platform Services Controller (https://vcsa1.lab.local/psc/login) and got this:

psc_error_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0


This error led me to KB2144086 (Updating certificates using certificate manager on vCenter Server or PSC 6.0 Update 1b fails), but was able to proof, that I have used different subject names for the different solution user certificates.

While digging in the PSC logs, I found this error in the /var/log/vmware/psc-client/psc-client.log:

Finally, I found Aaron Smiths blog post “Troubleshooting Expired PSC Certificates with vSphere 6“, who had the same problem. I checked the certificate of the Lookup Service and there it was:

psc_error_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This was the original custom certificate, issued by my AD-based enterprise CA, and installed on my vSphere 5.5 VCSA.

Aaron also offered the solution by referencing KB2118939 (Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0). I followed the instructions in KB2118939 and replaced the certificate of the Lookup Service with a certificate of the VMCA.

Take care of your certificates

With vSphere 6.0, the Lookup Service should be accessed through the HTTP Reverse Proxy. This proxy uses the machine certificate. Therefore, an expired Lookup Certificate is not obvious. If you connect directly to the Lookup Service using port 7444, you will see the expired certificate. The Lookup Service certificate is not replaced with a custom certificate, if you replace the different solution user certificates.

If you have a vSphere 6.0 VCSA, which was migrated from vSphere 5.5, and you have replaced the certificates on that vSphere 5.5 VCSA with custom certificates, you should check your Lookup Service certificate immidiately! Follow KB2118939 for further instructions.

Credit to Aaron Smith for this blog post. Thank you!

HPE 3PAR OS updates that fix VMware VAAI ATS Heartbeat issue

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Customers that use HPE 3PAR StoreServs with 3PAR OS 3.2.1 or 3.2.2 and VMware ESXi 5.5 U2 or later, might notice one or more of the following symptoms:

  • hosts lose connectivity to a VMFS5 datastore
  • hosts disconnect from the vCenter
  • VMs hang during I/O operations
  • you see the messages like these in the vobd.log or vCenter Events tab

  • you see the following messages in the vmkernel.log

Interestingly, not only HPE is affected by this. Multiple vendors have the same issue. VMware described this issue in KB2113956. HPE has published a customer advisory about this.

Workaround

If you have trouble and you can update, you can use this workaround. Disable ATS heartbeat for VMFS5 datastores. VMFS3 datastores are not affected by this issue. To disable ATS heartbeat, you can use this PowerCLI one-liner:

Solution

But there is also a solution. Most vendors have published firwmare updates for their products. HPE has released

  • 3PAR OS 3.2.2 MU3
  • 3PAR OS 3.2.2 EMU2 P33, and
  • 3PAR OS 3.2.1 EMU3 P45

All three releases of 3PAR OS include enhancements to improve ATS heartbeat. Because 3PAR OS 3.2.2 has also some nice enhancements for Adaptive Optimization, I recommend to update to 3PAR OS 3.2.2.

Monitoring hardware status with Python and vSphere API calls

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Apparently it’s “how to monitor hardware status” week on vcloudnine.de. Some days ago, I wrote an article about using SNMP for hardware monitoring. You can also use the vSphere Web Client to get the status of the host hardware. A third way is through the vSphere API. I just want to share a short example how to use vSphere API calls and pyVmomi. pyVmomi is the Python SDK for the VMware vSphere API.

Get hardware status with vSphere API calls

I just want to share a small example, that shows the basic principle. The script gathers the temperature sensor data of a ProLiant DL360 G7 running ESXi 6.0 U2 using vSphere API calls.

The output of the script looks like this:

Nothing fancy. You can easily loop through numericSensorInfo to gather data from other sensors. Use the Managed Object Browser (MOB) to navigate through the API. This is handy if you search for specific sensors. If you need accurate data, the vSphere API is the way to go. If you focus on something lightweight, try SNMP.

Missing hardware status tab in the vSphere Client

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I thought, everyone knows it, but I’m always being asked “Where’s the hardware status tab?” after an update from vSphere 5.x to 6. Many customers still use the vSphere Client (C # client), and steer clear of the vSphere Web Client. To be honest: Me too. I often use the C# client, especially if I do mass operations, or for a quick look at something.

This is really nothing new, the answer is clear. But I think it’s a good idea to write it down. At least for myself. As a reminder to use the vSphere Web Client.

The hardware status tab

Many customers used the hardware status tab to get a quick overview about the health of the ESXi host hardware.

vsphere_client_hw_status_tab

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

But after an update to vSphere 6 the hardware status tab is missing in the vSphere Client. This is an expected behaviour! VMware has published an knowledge base article about this (The Hardware Status Tab is no longer available in the vSphere Client after upgrading to vCenter Server 6.0). The only solution is to use the vSphere Web Client.

Use the vSphere Web Client

Meanwhile, the old vSphere Client has many downsides. All features introduced in vSphere 5.5 and later are only available through the vSphere Web Client. This also applies to the hardware status tab.

vsphere_web_client_hw_status_tab

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

You find the hardware status on the “Monitor” tab of a host. It offers the same information as the legacy hardware status tab from the vSphere Client.

Do yourself a favor and use the vSphere Web Client. Always, any time.