Tag Archives: vsphere

Storage vMotion stuck at 100% – cleaning up migration state

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Moving VMs from an old cluster with old ESXi hosts to a new cluster with new hosts can be so easy, even if the clusters doesn’t share any storage. A PowerCLI one-liner or the Web Client allow you to migrate VMs between hosts and datastores, while the VMs are running. This enhancement was added with vSphere 5.1. I’m often suprised how many customers doesn’t know this feature, just because they are still using the old vSphere C# client.

Some days ago, I had to move some VMs to a new cluster and I used this well known feature to move the running VMs to a new hosts. Everything was fine, until I realized, that the vMotion process stopped at 100%. Usually, the cleanup is finishes very quickly. But this time, the vMotion stopped.

Poking around in the fog

After waiting 15 minutes, I started investigating what is going on. One of the first things I discovered were large vmware.log files in the working directories of all running VMs. Each VM had a vmware.log with up to 8 GB. With lsof, I found a vpxa-worker process that was accessing the vmware.log of the currently moved VM. This corresponded with a slow file transfer (~ 5 MB/s), that was running for more than 15 minutes between the ESXi hosts. The vmware.log itself was full of VMware Tools debug messages. I checked the guest OS (Windows Server 2008 R2) and found a tools.conf (C:\ProgramData\VMware\VMware Tools\tools.conf) with this content:

Looks like someone has enabled the debugging mode for the VMware Tools… more then 18 months ago. I’ve changed the log level to “error”, copied the tools.conf to all VMs and did a restart of the VMware Tools service on each VM. This stopped the growth of the vmware.log files immediately.

But how to deal with the vmware.log files? You can instantly “shrink” the vmware.log file. All you need is SSH access to the ESXi host, that is running the VM.

Use the cp command to overwrite the vmware.log file. As you can see, the log file has then a size of 0 bytes. You don’t have to shutdown the VM for this. But you lose all the data stored in the vmware.log file!

Last words

You should ALWAYS disable debug logging, after you have the data you want. Never enable debug logging permanently!

Consider the Veeam Network transport mode if you use NFS datastores

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I’m using Veeam Backup & Replication (currently 8.0 Update 3) in my lab environment to backup some of my VMs to a HP StoreOnce VSA. The VMs reside in a NFS datastore on a Synology DS414slim NAS, the StoreOnce VSA is located in a local datastore (RAID 5 with SAS disks) on one of my ESXi hosts. The Veeam backup server is a VM and it’s also the Veeam Backup Proxy. The transport mode selection is set to “Automatic selection”.

Veeam Backup & Replication offers three different backup proxy transport modes:

  • Direct SAN Access
  • Virtual Appliance
  • Network

The Direct SAN Access transport mode is the recommended mode, if the VMs are located in shared datastores (connected via FC or iSCSI). The Veeam Backup Proxy needs access to the LUNs, so the Veeam Backup Proxy is mostly a physical machine. The data is directly read by the backup proxy from the LUNs. The Virtual Appliance mode uses the SCSI hot-add feature, which allows the attachment of disks to a running VM. In this case, the data is read by the backup proxy VM from the directly attached SCSI disk. In contrast to the Direct SAN Access mode, the Virtual Appliance mode can only be used if the backup proxy is a VM. The third transport mode is the Network transport mode. It can be used in any setup, regardless if the backup proxy is a VM or a physical machine. In this mode, the data is retrieved via the ESXi management network and travels over the network using the Network Block Device protocol (NBD or NBDSSL, latter is encrypted). This is a screenshot of the transport mode selection dialog of the backup proxy configuration.

veeam_transport_mode_selection

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

As you can see, the transport mode selection will happen automatically if you doesn’t select a specific transport mode. The selection will occur in the following order: Direct SAN Access > Virtual Appliance > Network. So if you have a physical backup proxy without direct access to the VMFS datastore LUNs, Veeam Backup & Replication will use the Network transport mode. A virtual backup proxy will use the Virtual Appliance transport. This explains why Veeam uses the Virtual Appliance transport mode in my lab environment.

Some days ago, I configured E-Mail notifications for some vCenter alarms. During the last nights I got alarm messages: A host has been disconnected from the vCenter. But the host reconnected some seconds later. Another observation was, that a running vSphere Client lost the connection to the vCenter Update Manager during the night. After some troubleshooting, I found indications, that some of my VMs became unresponsive. With this information, I quickly found the VMware KB article “Virtual machines residing on NFS storage become unresponsive during a snapshot removal operation (2010953)“. Therefore I switched the transport from Virtual Appliance to Network.

I recommend to use Network transport mode instead Virtual Appliance transport mode, if you have a virtual Veeam Backup Proxy and NFS datastores. I really can’t say that it’s running slower as the Virtual Appliance transport mode. It just works.

Important note for PernixData FVP customers

Remember to exclude the Veeam Backup Proxy VM from acceleration, if you use Virtual Appliance or NBD transport mode. If you use datastore policies, blacklist the VM or configure it as VADP appliance. If you use VM policies, simply doesn’t configure a policy for the Veeam Backup Proxy VM. If you use Direct SAN access, you need a pre- and a post-backup script to suspend the cache population during the backup. Check Frank Dennemans blog post about “PernixData FVP I/O Profiling PowerCLI commands“.

ESXi 5.5 U3b and later are no longer manageable without vCenter 5.5 U3b

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

On December 8, 2015, VMware released VMware ESXi 5.5 patch ESXi550-201512001 (2135410). This patch is known as U3b and it contains general and security fixes, nothing special. Usually, you would install this update without notice. But this time, you should better take a look into the release notes of ESXi 5.5 U3b, before you install this update. This is taken from the release notes:

Note: In your vSphere environment, you need to update vCenter Server to vCenter Server 5.5 Update 3b before updating ESXi to ESXi 5.5 Update 3b. vCenter Server will not be able to manage ESXi 5.5 Update 3b, if you update ESXi before updating vCenter Server to version 5.5 Update 3b. For more information about the sequence in which vSphere environments need to be updated, refer, KB 2057795.

To avoid problems, you have to update your vCenter first, then update your hosts. This sequence is recommended by VMware and I recommend to take a look into KB 2057795 (Update sequence for VMware vSphere 5.5 and its compatible VMware products)

PowerCLI: Get-LunPathState

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Careful preparation is a key element to success. If you restart a storage controller, or even the whole storage, you should be very sure that all ESXi hosts have enough paths to every datstore. Sure, you can use the VMware vSphere C# client or the Web Client to check every host and every datastore. But if you have a large cluster with a dozen datastores and some Raw Device Mappings (RDMs), this can take a looooong time. Checking the path state of each LUN is a task, which can be perfectly automated. Get a list of all hosts, loop through every host and every LUN, output a list of all hosts with all LUNs and all paths for each LUN. Sounds easy, right?

For a long time, I used this PowerCLI script for checking the LUN path state. But now I decided to give something back and I tweaked it a bit for my needs.

Feel free to use and/ or modify it.

Reset the HP iLO Administrator password with hponcfg on ESXi

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Sometimes you need to reset the ILO Administrator password. Sure, you can reboot the server, press F8 and then reset the Administrator password. If you have installed a HP customized ESXi image, then there is a much better way to reset the password: HPONCFG.

Check the /opt/hp/tools directory. You will find a binary called hponcfg.

All you need is a simple XML file. You can use the VI editor or you can copy the necessary file with WinSCP to the root home directory on your ESXi host. I prefer VI. Change the directory to /opt/hp/tools. Then open the pwreset.xml.

Press i to switch to the insert mode. Then paste this content into the file. You don’t have to know the current password!

Press ESC and then :wq<ENTER> to save the file and leave the VI. Now use HPONCFG together with the XML file to reset the password.

That’s it! You can now login with “Administrator” and “password”.

Screen resolution scaling has stopped working after Horizon View agent update

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Another inconvenience that I noticed during the update process from VMware Horizon View 6.1.1 to 6.2 was, that the automatic screen resizing stopped working. When I connected to a desktop pool with the VMware Horizon client, I only got the screen resolution of the VM (the resolution that is used when connecting to the VM with the vSphere console)), not 1920×1200 as expected. This issue only occured with PCoIP, not with RDP. I had this issue with a static desktop and a dynamic desktop pool, and it occurred after updating the Horizon View agent. The resolution scaling worked with a Windows 2012 R2 RDS host, when I connected to a RDS with PCoIP.

VMware KB1018158 (Configuring PCoIP for use with View Manager) did not solved the problem. I checked the VMX version, the video RAM config etc. Nothing has changed, everything was configured as expected. At this point it was clear to me, that this must be an issue with the Horizon View agent. I took some snapshots and tried to reinstall the Horizon View agent. I removed the Horizon View agent and the VMware tools from one of my static desktops. After a reboot, I installed the VMware tools and then the Horizon agent. To my surprise, this first attempt has solved the problem. I tried the same with my second static desktop pool VM and with the master VM of my dynamic desktop pool (don’t forget to recompose the VMs…). This workaround has fixed the problem in each case.

I don’t know if this is a bug. I haven’t found any hints in the VMware Community forum or blogs. Maybe someone knows the answer.

VMware Horizon View agent update on RDS host fails with “Internal Error 25030”

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I’m running a small VMware Horizon View environment in my lab. Nothing fancy, but all you need to show what Horizon View can do for you. This environment includes a Windows Server 2012 R2 RDS host. During the update process from Horizon View 6.1.1 to 6.2, I had to update the View agent on this RDS host. This update installation failed with an “Internal Error 25030”, followed by a rollback. Fortunately I had a snapshot, so I went back to the previous state and tried the update again. This attempt also went awry.

To make a long story short: Read the fscking release notes! This quote is taken from the Horizon View 6.2 release notes:

When you upgrade View Agent 6.1.1 to View Agent 6.2 on an RDS host running on Windows Server 2012 or 2012 R2, the upgrade fails with an “Internal Error 25030” message.
Workaround: Uninstall View Agent 6.1.1, restart the RDS host, and install View Agent 6.2.

And this is not the first time that this error occurred. I found this quote in the the Horizon View 6.1.1 release notes:

When you upgrade View Agent 6.1 to View Agent 6.1.1 on an RDS host running on Windows Server 2012 or 2012 R2, the upgrade fails with an “Internal Error 25030” message.
Workaround: Uninstall View Agent 6.1, restart the RDS host, and install View Agent 6.1.1

If you take a closer look at these two statements, you might notice some similarities… But I do not want to be spiteful. The workaround did the trick. Simply uninstall the View agent (if it’s still installed after the rollback… that was not the case with me), reboot and reinstall the View agent.

PernixData Architect Software

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

With the general availability of PernixData FVP 3.1, PernixData released the first version of PernixData Architect.

One of the biggest problems today is, that management tools are often focused on deployment and monitoring of applications or infrastructure. This doesn’t lead to a holistic view over applications and related data center infrastructure. You have to monitor at several points within the application stack and even then, you won’t get a holistic view. Without proper information, you can’t make proper decisions. At this point, PernixData Architect comes into play.

PernixData Architect is a software platform and supports the complete IT life cycle from design and deployment over operation and optimization. It supports the decision making process with data gathering and big data analytics. PernixData Architect continuously generates information and recommendations based on gathered data from VMs, storage devices, vCenter, network etc. This information pool can analysed with big data techniques. Data are gathered, data is set into context (this is what information is) and information are linked and combined with recommendations. Here are some examples what PernixData Architect can do for you (Source)

  • Descriptive Analytics – Identify and profile the top 10 VMs on latency, throughput and IOPS.
  • Predictive Analytics – Calculate server-side resources needed to run a VM in Write Through versus Write Back mode, ensuring optimal hardware is allocated before a problem arises.
  • Prescriptive Analytics – Recommend ideal server-side resources based on application patterns.

PernixData Architect is a software-only solution and can deployed with our without PernixData FVP. Without FVP, Architect can be used as a monitoring tool and gives you visibility, management and recommendations. Architect works with any server and storage platform that is compatible with VMware vSphere!

I’ve installed the latest PernixData FVP 3.1 release in my lab and enabled the 30 days trial period for PernixData Architect. You can access Architect through the web UI.

prnx_architect_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

As you can see, I have two clusters in my lab and both are accelerated using PernixData FVP. One cluster uses Distributed Fault Tolerant Memory (DFTM), the other cluster uses SSDs as acceleration ressources. If Architect is enabled, FVP doesn’t display any stats and refers to the Architect UI. Below a screenshot of the summary screen which gives you a good overview at the first glance.

prnx_architect_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Architect includes much more stats than FVP.

prnx_architect_3

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

On the “Intelligence” page, you get values for the working set for each ESXi host in the cluster. This is an important value for the right sizing of your acceleration ressources.

prnx_architect_4

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

As mentioned, PernixData Architect uses the gathered data to give you recommendations in realtime. Even in my lab cluster,  there are things to improve. ;)

prnx_architect_5

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This is only a short overview about PernixData Architect. But you might see now what insight architect can give you. If you are curious to see what PernixData FVP and Architect can do for you, you can simply install both products as part of a proof-of-concept and test them for 30 days. Even if you don’t want to install FVP, Architect can used without FVP. And even FVP can used without acceleration ressources in a monitoring mode.

Using VCSA as remote syslog – Don’t forget the log rotation!

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Important note: It seems that vCenter Server Appliance updates revert the changes. Please check the settings after each update!

The VMware vCenter Server Appliance (VCSA) can act as a remote syslog destition for ESXi hosts. This is very handy for troubleshooting and I really recommend to use this feature.  But VMware ESXi hosts can be really chatty and therefore it’s a good idea to keep an eye on the free disk space of the VCSA.

Yesterday, a colleague had an interesting support case. A customer reported that his Veeam Backup & Replication jobs failed and that he was unable to login to the vCenter with the vSphere Client and vSphere Web Client. My colleague checked the VCSA VM and noticed that the VPXD failed to start (“Waiting for vpxd to initialize: ….failed”). Together we checked the appliance and the log files. The vpxd.log (/var/log/vmware/vpx) was updated weeks ago, but the last entry was interesting: No space left on device. But there was free disk space on /storage/log. I immediately checked the inode count with df -i and there it was: No free inodes. Why is this a problem? Each name entry in the file system consumes an inode. If there are no free inodes, no new directories and files can be created. The error message is the same as for missing disk space. Something had to have created a lot of files on /storage/log. Because /var/log/vmware is a symbolic linkt to /storage/log/vmware, it had to be something on the /storage/log partition. We checked the remote syslog location under /storage/log/remote and found gigabytes and an incredible number of logs. After removing the logs, the VPXD was able to start and the inode count was on a normal level.

But why were there so many logs? We checked the logrotate config and found a faulty config for the remote syslog files. Instead of rotating logs and remove old ones, this config rotated all logs every day and potentiated the number of logs. Please note that there is no logrotate config to rotate remote syslog files by default! This one was added manually.

This is the default config for the remote syslog-collector of the VCSA:

As you can see, with these settings a folder for each host and each month is created. According to this VMTN posting, we changed the syslog-collector config a bit:

With this settings, only a single file per host is created. We made also a change to /etc/logrotate.d/syslog and added this at the end:

With this configuration 30 log files will be preserved. The number of log files or how often log rotation should happen (weekly or daily) can easily be adjusted. But these settings should be sufficient for small environments.

It’s important to understand that the VCSA has different disks and that the disks are mountend to different mount points within the root filesystem. This is from a vSphere 5.5 VCSA:

/var/log/vmware and /var/log/remote are links to /storage/log/vmware and /storage/log/remote. Make sure that there is always enough free diskspace on ALL disks! I also want to highlight VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). This error hit me a couple of times…

Automating ESXi configuration for DataCore SANsymphony-V

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

DataCore describes in their Host Configuration Guide for VMware ESXi some settings that must be adjusted before storage from DataCore SANsymphony-V storage servers will be assigned to the ESXi hosts. Today, for ESXi 5.x and 6.0, you have to add a custom rule and adjust the advanced setting DiskMaxIOSize. For ESX(i) 4 more parameters had to be adjusted. But I will focus on  ESXi 5.x and 6.0. You need to adjust these settings for each host that should get storage mapped from a DataCore storage server. If you have more then one host, you may have the wish to automate the necessary steps. The check the current value of DiskMaxIOSize, you can use this lines of PowerCLI code.

So set DiskMaxIOSize to the recommended value of 512 and to create the custom SATP rule, use this lines:

Please make sure that you adjust the $esxCluster variable.  Nothing fancy, but it can save some time – even if you only have two hosts. ;)