Storage vMotion stuck at 100% - cleaning up migration state

Moving VMs from an old cluster with old ESXi hosts to a new cluster with new hosts can be so easy, even if the clusters doesn’t share any storage. A PowerCLI one-liner or the Web Client allow you to migrate VMs between hosts and datastores, while the VMs are running. This enhancement was added with vSphere 5.1. I’m often suprised how many customers doesn’t know this feature, just because they are still using the old vSphere C# client.

Some days ago, I had to move some VMs to a new cluster and I used this well known feature to move the running VMs to a new hosts. Everything was fine, until I realized, that the vMotion process stopped at 100%. Usually, the cleanup is finishes very quickly. But this time, the vMotion stopped.

Poking around in the fog

After waiting 15 minutes, I started investigating what is going on. One of the first things I discovered were large vmware.log files in the working directories of all running VMs. Each VM had a vmware.log with up to 8 GB. With lsof, I found a vpxa-worker process that was accessing the vmware.log of the currently moved VM. This corresponded with a slow file transfer (~ 5 MB/s), that was running for more than 15 minutes between the ESXi hosts. The vmware.log itself was full of VMware Tools debug messages. I checked the guest OS (Windows Server 2008 R2) and found a tools.conf (C:\ProgramData\VMware\VMware Tools\tools.conf) with this content:

[logging]
log = true

# Enable tools service logging to vmware.log
vmsvc.level = debug
vmsvc.handler = vmx

# Enable new “vmusr” service logging to vmware.log
vmusr.level = error
vmusr.handler = vmx

# Enable “Volume Shadow Copy” service logging to vmware.log
vmvss.level = debug
vmvss.handler = vmx

Looks like someone has enabled the debugging mode for the VMware Tools… more then 18 months ago. I’ve changed the log level to “error”, copied the tools.conf to all VMs and did a restart of the VMware Tools service on each VM. This stopped the growth of the vmware.log files immediately.

But how to deal with the vmware.log files? You can instantly “shrink” the vmware.log file. All you need is SSH access to the ESXi host, that is running the VM.

/vmfs/volumes/f222c18e-31aad6e9/veeam # ls -l vmware.log
-rwxrwxrwx    1 1024     users       118531 Jan 14 08:07 vmware.log
/vmfs/volumes/f222c18e-31aad6e9/veeam # ls -lh vmware.log
-rwxrwxrwx    1 1024     users     115.8K Jan 14 08:07 vmware.log
/vmfs/volumes/f222c18e-31aad6e9/veeam # cp /dev/null vmware.log
/vmfs/volumes/f222c18e-31aad6e9/veeam # ls -lh vmware.log
-rwxrwxrwx    1 1024     users          0 Jan 14  2016 vmware.log
/vmfs/volumes/f222c18e-31aad6e9/veeam #

Use the cp command to overwrite the vmware.log file. As you can see, the log file has then a size of 0 bytes. You don’t have to shutdown the VM for this. But you lose all the data stored in the vmware.log file!

Last words

You should ALWAYS disable debug logging, after you have the data you want. Never enable debug logging permanently!