Tag Archives: troubleshooting

VMware Update Manager reports “error code 99” during scan operation

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After updating my lab to VMware vSphere 6.0 U2, one of my hosts continuously thrown an error during an update scan.

The first thing I’ve checked was the esxupdate.log on the affected ESXi host. This is the output, that was logged during a scan operation.

You might notice the “Unrecognized file vendor-index.xml in Metadata file” error. I also found this error message on the other hosts, so I excluded it from further research. It was unlikely, that this error was related to the observed problem. I started searching differences between the hosts and found out, that the output of “esxcli software vib list” was different on the faulty host.

This is the output on the faulty host:

This is the output on a working host. You see the difference?

Doesn’t look right… I investigated further, still searching for differences. And then I found two empty directories under /var/db/esximg.

The same directory was populated on other, working hosts.

One possible solution was therefore to copy the missing files to the faulty host. I used SCP for this. To get SCP working, you have to enable the SSH Client in the ESXi firewall.

After that, I’ve copied the files from a working host to the faulty host. Please make sure that the hosts have the same build! In my case, both hosts had the same build. Don’t try to copy files from an older or newer build to the host!!

And because we are pros, we disable the SSH Client after using it.

As expected, “esxcli software vib list” was working again.

A rescan operation in the vSphere Client was also successful. It seems that the root cause for the problem were missing files under /var/db/esximg.

Please don’t ask why this has happened. I really have no idea. But VMware KB2043170 (Initializing the VMware vCenter Update Manager database without reinstalling it) isn’t always the solution for “error code 99”, as sometimes written somewhere in the internet. Always try to analyze the problem and try to filter out unlikely and likely solutions.

Data Protector: Exchange backup failes because of database lock

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had a customer call, where a Exchange 2010 backup repeatedly failed. HPE Data Protector was unable to create a differential or incremental backup. For each database, the following error was logged:

Interestingly, there was no other backup session running. But the night before, the backup jobs failed because of a network failure.

The solution is easy. This error is caused by a wrong information in the Data Protector database. To remove this, open an administrative CMD on the Data Protector Cell Manager and run this omnidbutil command:

This command  will free up the locked resources in the Data Protector database.Then, run the job again.

ALE OmniSwitch stack does not form due to incompatible licenses

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I saw an interesting behaviour of two Alcatel-Lucent Enterprise OmniSwitch 6450. Both switches has been configured as a stack, but one of the switches showed a flashing ID after the startup, and the stack was not formed. While I checked the logs and the status of the stack, I noticed that the slot number was incorrect. Furthermore the status showed “INC-LIC”.

According to the stack status and the switch logs, there seems to be a problem with the licenses. So I checked the installed licenses on both switches. On switch showed Metro license:

The other switch not:

Don’t be confused because of the slot numbering. I pulled the stacking cable.

The solution was easy: I removed the metro license and after a reboot of the switch, from which I removed the license, the stack formed properly.

Using VCSA as remote syslog – Don’t forget the log rotation!

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Important note: It seems that vCenter Server Appliance updates revert the changes. Please check the settings after each update!

The VMware vCenter Server Appliance (VCSA) can act as a remote syslog destition for ESXi hosts. This is very handy for troubleshooting and I really recommend to use this feature.  But VMware ESXi hosts can be really chatty and therefore it’s a good idea to keep an eye on the free disk space of the VCSA.

Yesterday, a colleague had an interesting support case. A customer reported that his Veeam Backup & Replication jobs failed and that he was unable to login to the vCenter with the vSphere Client and vSphere Web Client. My colleague checked the VCSA VM and noticed that the VPXD failed to start (“Waiting for vpxd to initialize: ….failed”). Together we checked the appliance and the log files. The vpxd.log (/var/log/vmware/vpx) was updated weeks ago, but the last entry was interesting: No space left on device. But there was free disk space on /storage/log. I immediately checked the inode count with df -i and there it was: No free inodes. Why is this a problem? Each name entry in the file system consumes an inode. If there are no free inodes, no new directories and files can be created. The error message is the same as for missing disk space. Something had to have created a lot of files on /storage/log. Because /var/log/vmware is a symbolic linkt to /storage/log/vmware, it had to be something on the /storage/log partition. We checked the remote syslog location under /storage/log/remote and found gigabytes and an incredible number of logs. After removing the logs, the VPXD was able to start and the inode count was on a normal level.

But why were there so many logs? We checked the logrotate config and found a faulty config for the remote syslog files. Instead of rotating logs and remove old ones, this config rotated all logs every day and potentiated the number of logs. Please note that there is no logrotate config to rotate remote syslog files by default! This one was added manually.

This is the default config for the remote syslog-collector of the VCSA:

As you can see, with these settings a folder for each host and each month is created. According to this VMTN posting, we changed the syslog-collector config a bit:

With this settings, only a single file per host is created. We made also a change to /etc/logrotate.d/syslog and added this at the end:

With this configuration 30 log files will be preserved. The number of log files or how often log rotation should happen (weekly or daily) can easily be adjusted. But these settings should be sufficient for small environments.

It’s important to understand that the VCSA has different disks and that the disks are mountend to different mount points within the root filesystem. This is from a vSphere 5.5 VCSA:

/var/log/vmware and /var/log/remote are links to /storage/log/vmware and /storage/log/remote. Make sure that there is always enough free diskspace on ALL disks! I also want to highlight VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). This error hit me a couple of times…

Chicken-and-egg problem: 3PAR VSP 4.3 MU1 & 3PAR OS 3.2.1 MU3

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Since monday I’m helping a customer to put two HP 3PAR StoreServ 7200c into operation. Both StoreServs came factory-installed with 3PAR OS 3.2.1 MU3, which is available since July 2015. Usually, the first thing you do is to deploy the 3PAR Service Processor (SP). These days this is (in most cases) a Virtual Service Processor (VSP). The SP is used to initialize the storage system. Later, the SP reports to HP and it’s used for maintenance tasks like shutdown the StoreServ, install updates and patches. There are only a few cases in which you start the Out-of-the-Box (OOTB) procedure of the StoreServ without having a VSP. I deployed two (one VSP for each StoreServ) VSPs, started the Service Processor Setup Wizard, entered the StoreServ serial number and got this message:

3par_vsp_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

“No uninitialized storage system with the specified serial number could be found”. I double checked the network setup, VLANs, switch ports etc. The error occured with BOTH VSPs and BOTH StoreServs. I started the OOTB on both StoreServs using the serial console. My plan was to import the StoreServs later into the VSPs. To realize this, I tried was to setup the VSP using the console interface. I logged in as root (no password) and tried the third option: Setup SP with original SP ID.

3par_vsp_error_console

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Not the worst idea, but unsuccessful. I entered the SP ID, SP networking details, a lot other stuff, the serial number of the StoreServ, the IP address, credentials finally got this message:

Hmm… I knew that P003 was mandatory for the VSP 4.3 MU1 and 3PAR OS 3.2.1 MU3. But could cause the missing patch this behaviour? I called HP and explained my guess. After a short remote session this morning, the support case was escalated to the 2nd level. While waiting for the 2nd level support, I was thinking about a solution. I knew that earlier releases of the VSP doesn’t check the serial number of the StoreServ or the version of the 3PAR OS. So I grabbed a copy of the VSP 4.1 MU2 with P009 and deployed the VSP. This time, I was able to finish the “Moment of Birth” (MOB). This release also asked for the serial number, the IP address and login credentials, but it didn’t checked the version of the 3PAR OS (or it doesn’t care if it’s unknown). At this point I had a functional SP running software release 4.1 MU2. I upgraded the SP to 4.3 MU1 with the physical SP ISO image and installed P003 afterwards. Now I was able to import the StoreServ 7200c with 3PAR OS 3.2.1 MU3.

I don’t know how HP covers this during the installation service. AFAIK there is no VSP 4.3 MU1 with P003 available and I guess HP ships all new StoreServs with 3PAR OS 3.2.1 MU3. If you upgrade from an earlier 3PAR OS release, please make sure that you install P003 before you update the 3PAR OS. The StoreServ Refresh matrix clearly says that P003 is mandatory. The release notes for the HP 3PAR Service Processor (SP) Software SP-4.3.0 MU1 P003 also indicate this:

SP-4.3.0.GA-24 P003 is a mandatory patch for SP-4.3.0.GA-24 and 3.2.1.MU3.

I’m excited to hear from the HP 2nd level support. I will update this blog post if I have more information.

EDIT

Together with the StoreServ 8000 series, HP released a new version of the 3PAR Service Processor. The new version 4.4 is necessary for the new StoreServ models, but it also supports 3PAR OS < 3.2.2 (which is the GA release for the new StoreServ models). So if you get a new StoreServ 7000 with 3PAR OS 3.2.1 MU3, simply deploy a SP version 4.4.

Microsoft Exchange 2013 shows blank ECP & OWA after changes to SSL certificates

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in CU6.

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

Now add it again with the correct certificate hash and application ID.

That’s it. Reboot the Exchange 2013 server and everything should be up and running again.

DataCore mirrored virtual disks full recovery fails repeatedly

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last sunday a customer suffered a power outage for a few hours. Unfortunately the DataCore Storage Server in the affected datacenter weren’t shutdown and therefore it crashed. After the power was back, the Storage Server was started and the recoveries for the mirrored virtual disks started. Hours later, three mirrored virtual disks were still running full recoveries and the recovery for each of them failed repeatedly.

virtual_disk_error_ds10_mirror

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The recovery ran until a specific point, failed and started again. When the recovery failed, several events were logged on the Storage Server in the other datacenter (the Storage Server that wasn’t affected from the power outage):

Source: DcsPool, Event ID: 29

Source: disk, Event ID: 7

Source: Cissesrv, Event ID: 24606

The DataCore support quickly confirmed what we already knew: We had trouble with the backend storage on the DataCore Storage Server that was serving the full recovies for the recovering Storage Server. The full recoveries ran until the point at which a non-readable block was hit. Clearly a problem with the backend storage.

Summary

To summarize this very painful situation:

  • VMFS datastore with productive VMs on DataCore mirrored virtual disks with no redundancy
  • Trouble with the backend storage on the DataCore Storage Server, that was serving the mirrored virtual disks with no redundancy

Next steps

The customer and I decided to evacuate the VMs from the three affected datastores (each mirrored virtual disks represents a VMFS datastore). To avoid more trouble, we decided to split the unhealthy mirrors. So we had three single virtual disks. After the shutdown of the VMs on the affected datastores, we started a single storage vMotions at a time to move the VMs to other datastores. This worked until the storage vMotion hit the non-readable blocks. The storage vMotions failed and the single virtual disks went also into the status “Failed”. After that, we mounted the single virtual disks from the other DataCore Storage Server (that one, that was affected from the power outage and which was running the full recoveries). We expected that the VMFS on the single virtual disks was broken, but to our suprise we were able to mount the datastores. We moved the VMs from the datastores to other datastores. This process was flawless. Just to make this clear: We were able to mount the VMFS on virtual disks, that were in the status “Full Recovery pending”. I was quite sure that there was garbage on the disks, especially if you consider, that there was a full recovery running that never finished.

The only way to remove the logical block errors is to rebuild the logical drive on the RAID controller. This means:

  • Pray for good luck
  • Break all mirrored virtual disks
  • Remove the resulting single virtual disks
  • Remove the disks from the DataCore disk pool
  • Remove the DataCore disk pool
  • Remove the logical drives on the RAID controller
  • Remove the arrays on the RAID controller
  • Replace the faulty physical disks
  • Rebuild the arrays
  • Rebuild the logical drives
  • Create a new DataCore disk pool
  • Add disks to the DataCore disk pool
  • Add mirrors to the single virtual disks
  • Wait until the full recoveries have finished
  • Treat yourself to a beer

Final words

This was very, very painful and, unfortunately, not the first time I had to do this for this customer. The customer is in close contact to the vendor of the backend storage to identify the root cause.

Windows guest customization fails after cloning a VM

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week I got a call from a customer. The customer has tried to deploy new Citrix XenApp servers, and because the VMware template was a bit outdated, he tried to clone a provisioned and running Citrix XenApp VM. During this, the customer applied a guest customization specification to customize the guest OS (IP address, hostname etc). Until this point everything was fine. But after the clone process, the guest customization started, but never finished.

vm_deployment_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Using the VMware template, deployment and customization were successful. So the main problem was, that the customer was unable to use a provisioned and running Windows guest to deploy new Windows guests. I checked to logs and found this error messages in the setupact.log (you can find this log under C:\windows\system32\sysprep\panther):

I checked the rearm count with slmgr.vbs /dlv and saw, that the remaining count was 1.

Cloning and customizing a Windows VM with a rearm count of 1 leads to the observed behaviour. After the cloning and the start of the customization, the rearm count is 0. Microsoft describes this behaviour in KB929828.

CAUSE

This error may occur if the Windows Software Licensing Rearm program has run more than three times in a single Windows image.

RESOLUTION
To resolve this issue, you must rebuild the Windows image.

vExpert Maish Saidel-Keesing wrote about this in his blog in 2011. He explained it very well, make sure that you read his three blog posts!

In my case, rebuilding the template wasn’t an option. Therefore I had to reset the rearm count. I searched a while and found a solution that has worked for me. I’m quite sure that Microsoft doesn’t allow this, therefore I will not describe this procedure in detail. You will find it easily in the web…

The main task is to remove the WPA registry key. This key is protected under normal operation, so you have to do this using WinRE (Windows ecovery Environment) or WinPE (Windows Preinstallation Environment). After the removal of the WPA registry key, reboot the VM, add a new key using slmgr.vbs /ipk and active the Windows installation. You can check the rearm counter using slmgr.vbs /dlv and you will notice that the rearm counter is resetted.

Always keep in mind that you can’t use sysprep with a Windows installations an infinite number of times.

HP StoreOnce: Avoid special characters in NAS share description

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

While I was playing with my shiny, new HP StoreOnce VSA in my lab, I noticed a curious behavior. I created a NAS share for some tests with Veeam Backup & Replication. Creating a new share is nothing fancy. You can create a share in two ways:

  • using the GUI, or
  • using the CLI

So I created a new share:

storeonce_create_share_gui_01

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Nothing special, as you can see. I opened up a Explorer, typed in the IP address of my StoreOnce VSA and… saw no share.

storeonce_access_share_01

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

I repeated this process a couple of times, always with the same result. Then I went to the CLI and checked the newly created share:

So far, so good. I removed the share and tried to create the share using the CLI:

The command failed, no share was created. I verified the syntax, but the syntax of the command was correct. I started to simplify the command and removed the description.

The share was added with the default description. I removed the share and tried it again with my description. The command failed again. After removing the ampersand (&) from the description, the share could be added. I tried the same from the GUI. Using the GUI, a share with a ampersand (&) in the description field could be added, but it wasn’t accessible. Even if I removed the ampersand (&) from the share description. I had to remove and re-create the share with a valid description. Unfortunately the GUI allows you to create the share, even if the CLI command fails with the same settings. The GUI also doesn’t allow you to create the share with an empty description.

At this point, I can’t say if this is a bug or a known behaviour. I’m in contact with HP to clarify this. But you should avoid the usage of special characters in the NAS share description.

EDIT

Today, I got an e-mail from the HP StoreOnce Engineering. They informed me, that it’s not only the ampersand (&) you should avoid. You should avoid a set of special characters

  • `
  • *
  • &
  • %
  • +
  • multiple space in a row

These characters can cause minor issues with Windows tools, like the Explorer. As a result, these special characters were banned in the latest 3.12.x CIFS server code. However this ban was not messaged in the GUI. As a fix, this ban will be lifted from 3.12.2 software to allow the use of the above mentioned special characters.

vCenter Server Appliance: Troubleshooting full database partition

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer of mine had within 6 months twice a full database partition on a VMware vCenter Server Appliance. After the first outage, the customer increased the size of the partition which is mounted to /storage/db. Some months later, some days ago, the vCSA became unresponsive again. Again because of a filled up database partition. The customer increased the size of the database partition again  (~ 200 GB!!) and today I had time to take a look at this nasty vCSA.

The situation

vcsa_overview

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Within 2 days, the storage usage of the databse increased from 75% to 77%. First, I checked the size of the database:

 As you can see, the database had only 2 GB. The pg_log directory was more interesting:

 The directory was full with log files. The log files containted only one message:

The solution

This led me to VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). And yes, this appliance was upgraded to U2 with high probability. The solution is described in KB2092127, and is really easy to implement. Please note that this is only a workaround. There’s currently no solution, as mentioned in the article.