Tag Archives: troubleshooting

Exchange DAG member dies during snapshot creation

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Yesterday, a customer called me and told me about a scary observation on one of his Exchange 2016 DAG (Database Availability Groups) nodes.

In preparation of a security check, my customer created a snapshot of a Exchange 2016 DAG node. This node is part of a two node Windows Server 2012 R2/ Exchange 2016 CU7 cluster.

That something went wrong was instantly clear, after the first alarm messages were received. My customer opened a console windows and saw, that the VM was booting.

What went wrong?

Nothing. Something worked as designed, except the fact, that the observed behaviour was not intended.

That a snapshot was created was clearly visible in the logs. Interesting was the amount of time, that the snapshot creation took. It took 5 minutes from the start of the snapshot creation until the task finished. During this time, pretty much data was written to the disks.

VMware vSphere Throughput Snapshot Creation

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The server eventlog contained an entry, that pointed me to to the right direction.

Event Type: Information
Event ID: 1001
Source: BugCheck
Description: The computer has rebooted from a bugcheck. The bugcheck was 0x0000009E (0xffffe0001eccf900, 0x000000000000003c, 0x000000000000000a, 0x0000000000000000).

The Ask the Core Team wrote a nice blog post about this STOP error. In short: The failvoer clustering service incorporates a detection mechanism that may detect unresponsive user-mode processes. If an unresponsive user-mode process is detected, a HangRecoveryAction is called. Since Windows Server 2008, a STOP error (Bugcheck) is caused on the cluster node.

Most likely hypothesis

My explanation of the observed behaviour is, that my customer accidentally created a snapshot that has contained the VM memory. Because the Exchange server has 32 GB memory, the snapshot creation took some time and the VM became unresponsive. As the VM was responding again, the HangRecoveryAction did its dirty job.

Check if the checkbox for the VM memory is disabled, before you create a snapshot. Otherwise the bugcheck will do its job. Please note, that you might see this behaviour in all Microsoft Windows Failover Clusters, not only with Microsoft Exchange.

Exchange receive connector rejects incoming connections

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a bigger Microsoft Exchange migration, one of my customers moved the in- and outbound mailflow to a newly installed mail relay cluster. We modified MX records to move the mailflow to the new mail relay, because the customer also switched the ISP. While changing the MX records for ~40 domains, and therefore more and more mails received through the new mail relay cluster, we noticed events from MSExchangeTransport (event id 1021):

Receive connector Default Frontend EXCHANGE rejected an incoming connection
from IP address 192.168.xxx.xxx. The maximum number of connections per source
20) for this connector has been reached by this source IP address.

192.168.xxx.xxx is the mail relay cluster, which is used for the in- and outbound mailflow.

This event indicates that the remote server has reached the maximum number of simultaneous incoming connections to the receive connector. This value is specified by the MaxInboundConnectionPerSource  parameter, and the default value is 20. You can easily increase the value using the Set-ReceiveConnector  cmdlet.

Set-ReceiveConnector - identity "Default Frontend EXCHANGE" -MaxInboundconnectionPersource 100

Microsoft has decreased this value over time. It was 100 in Exchange 2007, but 20 since Exchange 2010.

Roaming of AppData-Local breaks Windows 10 Start Menu

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

One of my customers has started a project to create a Windows 10 Enterprise (LTSB 2016) master for their VMware Horizon View environment. Beside the fact (okay, it is more a personal feeling), that Windows 10 is a real PITA for VDI, I noticed an interesting issue during tests.

The issue

For convenience, I adopted some settings of the current Persona Management GPO for Windows 7 for the new Windows 10 environment. During the tests, the customer and I noticed a strange behaviour: After login, the start menu won’t open. The only solution was to logoff and delete the persona folder (most folders are redirected using native Folder Redirections, not the redirection feature of the View Persona Management). While debugging this issue, I found this error in the eventlog.

Faulting application name: ShellExperienceHost.exe, version: 10.0.14393.477, time stamp: 0x5819bf85
Faulting module name: Windows.UI.Xaml.dll, version: 10.0.14393.477, time stamp: 0x58ba5c3d
Exception code: 0xc000027b
Fault offset: 0x00000000006d611b
Faulting process id: 0x1548
Faulting application start time: 0x01d1ab8009bce144
Faulting application path: C:\Windows\SystemApps\ShellExperienceHost_cw5n1h2txyewy\ShellExperienceHost.exe
Faulting module path: C:\Windows\System32\Windows.UI.Xaml.dll
Report Id: f38fda11-bd15-46d6-bf4a-d26a348218d5
Faulting package full name: Microsoft.Windows.ShellExperienceHost_10.0.14393.953_neutral_neutral_cw5n1h2txyewy
Faulting package-relative application ID: App

If you google this, you will find many, many threads about this. Most solutions describe, that you have to delete the profile due to wrong permissions on profile folders and/ or registry hives. I used Microsofts Procmon to verify this, but I was unable to confirm that. After further investigations, I found hints, that the TileDataLayer database could be the problem. The database is located in AppData\Local\TileDataLayer\Database and stores the installed apps, programs, and tiles for the Start Menu. AFAIK it also includes the Start Menu layout.

Windows 10 Tile Data Layer Files

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The database is part of the local part of the profile. A quick check proved, that it’s sufficient to delete the TileDataLayer folder. It will be recreated during the next logon.

The solution

It’s simple: Don’t roam AppData\Local. It should not be necessary to roam the local part of a users profile. The View Persona Management offers an option to roam the local part the profile. You can configure this behaviour with a GPO setting.

Horizon View Persona Management Roaming GPO Settings

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

You can find this setting under Computer Configuration > Administrative Templates > VMware View Agent Configuration > Persona Management > Roaming & Synchronization

I was able to reproduce the observed behavior in my lab with Windows 10 Enterprise (LTSB 2016) and Horizon View 7.0.3. Because of this, I tend to recommend not to roam AppData\Local.

Checking the 3PAR Quorum Witness appliance

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

3PAR Quorum Witness Status

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

How to check the Quorum Witness appliance?

You can check the status of the appliance with a simple web request. The documentation shows a simple test based on curl. You can run this direct from the BASH of the appliance.

[[email protected] ~]# curl http://10.0.0.99:8080
{"couchdb":"Welcome","version":"1.0.4"}
[[email protected] ~]#

But you can also use the PowerShell cmdlet Invoke-WebRequest.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080


StatusCode        : 200
StatusDescription : OK
Content           : {"couchdb":"Welcome","version":"1.0.4"}

RawContent        : HTTP/1.1 200 OK
                    Content-Length: 40
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:31:37 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"couchdb...
Forms             : {}
Headers           : {[Content-Length, 40], [Cache-Control, must-revalidate], [Content-Type, text/plain;charset=utf-8],
                    [Date, Mon, 30 Jan 2017 08:31:37 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 40

If you add /witness to the URL, you can test the access to the database, which is used for Peer Persistence.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080/witness


StatusCode        : 200
StatusDescription : OK
Content           : {"db_name":"witness","doc_count":5,"doc_del_count":4,"update_seq":149557915,"purge_seq":0,"compact_
                    running":false,"disk_size":48988254,"instance_start_time":"1485763322826940","disk_format_version":
                    5,...
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 234
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:36:38 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"db_nam...
Forms             : {}
Headers           : {[Content-Length, 234], [Cache-Control, must-revalidate], [Content-Type,
                    text/plain;charset=utf-8], [Date, Mon, 30 Jan 2017 08:36:38 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 234

If you get a connection error, check if the beam process is running.

[[email protected] ~]# netstat -tulpen |grep 8080
tcp        0      0 0.0.0.0:8080                0.0.0.0:*                   LISTEN      495        10726      1643/beam
[[email protected] ~]#

If not, reboot the appliance. This can be done without downtime. The appliance comes only into play, if a failover occurs.

Replacing an expired lookup service SSL certificate on a vSphere PSC

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A few days ago, I ran into a very nasty problem. Fortunately, it was in my lab. Some months ago, I replaced the certificates of my vCenter Server Appliance (VCSA), and I’ve chosen to use the VMware Certificate Authority (VMCA) as a subordinate of my AD-based enterprise CA. The VMCA was used as intermediate CA. The certificates were replaced using the  vSphere 6.0 Certificate Manager (/usr/lib/vmware-vmca/bin/certificate-manager), and I followed the instructions of KB2112016 (Configuring VMware vSphere 6.0 VMware Certificate Authority as a subordinate Certificate Authority).

The VCSA was migrated from vSphere 5.5, and with vSphere 5.5 I was also using custom certificates. These certificates were also issued by my AD-based enterprise CA, and these certificates were migration during the vSphere 5.5 > 6.0 migration. So at the end, I replaced custom certificates with VMCA (as an intermediate CA) certificates.

Everything was fine, until a power outage. After powering-on my VMs, I noticed several errors. After logging into the vSphere Web Client, I got an error message at the top of the page:

Error occurred while processing request. Check vSphere WebClient logs for details.

While searching for the cause, I checked the URL of the Platform Services Controller (https://vcsa1.lab.local/psc/login) and got this:

psc_error_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

HTTP Status 400 - An error occurred while sending an authentication request to the PSC Single Sign-On server - null

type Status report
message An error occurred while sending an authentication request to the PSC Single Sign-On server - null
description The request sent by the client was syntactically incorrect.

This error led me to KB2144086 (Updating certificates using certificate manager on vCenter Server or PSC 6.0 Update 1b fails), but was able to proof, that I have used different subject names for the different solution user certificates.

While digging in the PSC logs, I found this error in the /var/log/vmware/psc-client/psc-client.log:

Caused by: com.vmware.vim.vmomi.client.exception.VlsiCertificateException: Server certificate chain is not trusted and thumbprint doesn't match
        at com.vmware.vim.vmomi.client.http.impl.ThumbprintTrustManager.checkServerTrusted(ThumbprintTrustManager.java:217)
        at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(Unknown Source)

        ... 71 more

Finally, I found Aaron Smiths blog post “Troubleshooting Expired PSC Certificates with vSphere 6“, who had the same problem. I checked the certificate of the Lookup Service and there it was:

psc_error_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This was the original custom certificate, issued by my AD-based enterprise CA, and installed on my vSphere 5.5 VCSA.

Aaron also offered the solution by referencing KB2118939 (Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0). I followed the instructions in KB2118939 and replaced the certificate of the Lookup Service with a certificate of the VMCA.

Take care of your certificates

With vSphere 6.0, the Lookup Service should be accessed through the HTTP Reverse Proxy. This proxy uses the machine certificate. Therefore, an expired Lookup Certificate is not obvious. If you connect directly to the Lookup Service using port 7444, you will see the expired certificate. The Lookup Service certificate is not replaced with a custom certificate, if you replace the different solution user certificates.

If you have a vSphere 6.0 VCSA, which was migrated from vSphere 5.5, and you have replaced the certificates on that vSphere 5.5 VCSA with custom certificates, you should check your Lookup Service certificate immidiately! Follow KB2118939 for further instructions.

Credit to Aaron Smith for this blog post. Thank you!

Solving problems: A structured approach

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Every problem has characteristics, that can be used to describe it.

  • Solubility

Not every problem is solvable. Think about “Squaring the circle“. But often a problem seems to be unsolvable because it’s not well defined. If Initial situation, obstacle and target situation are not clearly formulated, you won’t be able to solve the problem.

  • Decomposability

If you can decompose a problem into multiple subproblems, it is a hierarchical problem. Otherwise, it’s an elemental problem.

  • Effort

The effort to solve a problem is always different.. A problem is theoretically solvable, but it may require such a high effort, that it is practically unsolvable.

  • Subjectivity

Even if a problem is well defined, it appears different in regard to complexity for different people.

How to start?

First of all

  • Understand and define the problem

This is most important part. Before you try to solve a problem, make sure that you have really understood the problem. Then you should define the problem. Only a clearly defined problem can be solved. And it’s much easier to solve a clearly defined, than a vague problem. If it’s a complex problem, then you should try to

  • Simplify or decompose the problem

A simplified problem can help you stay focused. If you can’t simplify a problem, you can try to divide it into subproblems. With a clearly defined, simplified/ structured problem, you can start to

  • Find the root cause

Collecting information is the key. Collect information about what happened before, during, and after the problem has occurred. Identifying the root cause for a problem can be a time consuming task. But let me say this clearly: Information is the key. Information that help to find the root cause are not only observations (e.g. logs, error messages etc.). You can can use the results of systematic tests. Collect as much data as you can.

Sometimes it can be useful to create a hypothesis.

Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.

If you see that System A is affected, but system B should be affected too, but it’s not, it might be time to create a hypothesis. With a hypothesis in mind, you can try to prove it. Test the hypothesis by performing tests and collecting data. This strategy is called “hypothesis testing”.

At some point, you should have identified the root cause. With the now known root cause, you can

  • Create solutions and select the best one

Sometimes it’s easy. But sometimes it’ not that easy. A trade-off analysis can help to identify the best of multiple solutions.

  • Create an action plan

Even if you only have to disable a specific feature, it’s a good idea to formulate an action plan. Even if consists only of three lines… You should state clearly

  • WHAT you do
  • WHY do you have to do it, and
  • HOW to you plan to check it

With these steps, you should be well prepared. It doesn’t matter what kind of problem you are trying to solve: The process is basically the same.

Other problem solving methods

Over the years many problem solving methods have been developed. Kepner-Tregoe is one of them. Other well known methods are:

  • A3 Problem Solving
  • PDCA
  • Eight Disciplines (8D) Problem Solving
  • Failure mode and effects analysis (FMEA)

A3 Problem Solving has been developed at Toyota for their Toyota Production System (TPS). It’s an often used method in Lean Manufacturing. A3 helps to solve problems by pretending a structure (WHAT IS and WHAT IS NOT the problem, describe the problem, root cause, solution etc). This strucure is placed on an A3 sheet paper (that why it’s called A3). The process is based on the principles of Deming’s PDCA cycle.

PDCA, or Plan-Do-Check-Act (sometimes Shewhart-Cycle) was made popular by Dr. Edwards Deming. Plan-Do-Check-Act refers to the four phases of this cycle.

  • Plan: Plan the change
  • Do: implement the change
  • Check: Check the sucess of the implemented change
  • Act: Take action based on the results of “Check”

Eight Disciplines (8D) Problem Solving was developed by the Ford Motor Company. The D0 phase is the starting point for the D8 process, but it’s not counted.

  • D0:  Plan for solving the problem and determine the prerequisites
  • D1: Establish a team of people with the required skills and knowledge
  • D2: Describe the problem
  • D3: Define and implement containment actions
  • D4: Determine and verify the root causes
  • D5: Plan permanent corrective actions for the observed problem
  • D6: Implement the best permanent corrective actions
  • D7: Modify management systems to prevent a recurrence
  • D8: Congratulate your team!

The Failure mode and effects analysis (FMEA) is a highly structured, systematic approach for failure analysis. There are different FMEA alalyses:

  • Functional
  • Design
  • Process

FMEA is based on inductive reasoning (forward logic). FMEA is based on a highly structured process, which can be represented as followed.

  • Structural analysis: A system is divided into its components
  • Functional analysis: Identify the function of each component
  • Failure analysis: Identify the possible failures for each component
  • Calculate the risk: Risk Priority Number = occurrence ranking x detection ranking x highest severity ranking
  • Optimize: Optimize the component to mitigate the risk

No matter what, stay organized

The key to successfully solve problems is to stay organized. Solving problems isn’t magic. It is a very structured process that gets better with increasing experience. Try to create your own, structured method. Or use one of the mentioned problem solving methods. But in general:

  • Always try to describe a problem
  • Try to simplify or break it into smaller problems
  • Search and verify for the root cause
  • Develop a solution

Windows recieves wrong DNS server from DHCP after DHCPINFORM

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, I was surprisingly booked by a customer who observed a problem in his network. Unfortunately, colleagues worked on this network some day before (moving servers, routers etc. to a new pair of HP 7509 new core switches).

It was quickly clear, that some of the clients have received the wrong DNS servers from the DHCP server. The environment is a bit unusual. The customer is running two Active Directory domains (root and sub domain) in a single layer 2 broadcast domain. This nothing unusual, but he is also running two DHCP servers in the same layer 2 broadcast domain. To get this working, the customer uses exclusion ranges and reservations. This guarantees, that the client receives the correct DHCP information.

Observations

It was quicky clear, that some of the clients have received the wrong DNS servers from the DHCP server. That is the (defaced) output of a SUBDOM client with the correct IP settings.

   IPv4 Address  . . . . . . . . . . : 10.1.1.146(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . : 10.1.1.1
   DHCP Server . . . . . . . . . . . : 10.1.1.2
   DNS Servers . . . . . . . . . . . : 10.1.1.2
                                       10.10.1.2

And this is the output of the same client, after a reboot:

   IPv4 Address  . . . . . . . . . . : 10.1.1.146(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . : 10.1.1.1
   DHCP Server . . . . . . . . . . . : 10.1.1.2
   DNS Servers . . . . . . . . . . . : 172.16.1.2
                                       172.16.1.3

As you can see, the client got its DHCP information from the same DHCP server, but with the wrong DNS settings. The DNS servers are the servers from the ROOTDOM.

  • only clients from SUBDOM were affected
  • only Windows XP and Windows 7 clients were affected
  • Windows 8.1 and Windows 10 clients were not affected
  • Igel Thin Clients were not affected
  • after an “ipconfig /renew”, the correct DNS servers were registered
  • after an “ipconfig /release” and “ipconfig /renew”, the wrong DNS servers were registered
  • the same happened after a reboot
  • Wireshard packet trace showed, that the correct DNS information were included in the DHCPOFFER

The sum of observation told me, that this has nothing (or less) to do with the network changes. Interestingly, the correct DNS information were included in the DHCPOFFER and the behaviour was only observed on Windows XP and Windows 7. In addition, only clients of SUBDOM were affected.

The smoking gun

The packet trace with Wireshark showed, that the correct DNS information was included in the DHCPOFFER. But I also saw, that the client has sent a DHCPINFORM, which was answered by all available DHCP servers.

dhcp_inform

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

DHCPINFORM is used by a client to discover more information, e.g. router, proxy, static routes… or DNS. The DHCPINFORM request was only sent after a reboot, or after a DHCPRELEASE and a subsequent DHCPDISCOVER. I saw that all available DHCP servers answered the DHCPINFORM request with a DHCPACK. This DHCPACK included the requested information, including DNS. I quickly developed the hypothesis, that the DHCPACK from the ROOTDOM DHCP servers was used by the client, to add the wrong DNS information to the configuration.

With this information, I quickly found a references (MerakiLaurent Gaffié) to a registry key, that can be used to disable DHCPINFORM. This registry key is valid for Windows 2000, 2003, Windows XP, Windows Vista and Windows 7. Especially the blog post from Laurent Gaffié gave is interesting:

A vulnerability in Windows DHCP (http://www.ietf.org/rfc/rfc2131.txt) was found on Windows OS versions ranging from Windows 2000 through to Windows server 2003.  This vulnerability allows an attacker to remotely overwrite DNS, Gateway, IP Addresses, routing, WINS server, WPAD, and server configuration with no user interaction.

It’s useful to disable DHCPINFORM, even if you don’t have a problem!

Disable DHCPINFORM

To disable DHCPINFORM, you must add a registry key for the network interface, that shouldn’t sent DHCPINFORM messages.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface GUID} 
Value Type: DWORD
Value name: UseInform
DWORD Value: 0

Unfortunately the GUID of the interface differs between clients. I build this Visual Basic script (like Dr. Frankenstein: Different sources, plugged together, but it works) to add the registry key. You can run this script as part of a startup script with a Group Policy.

You should test this script very carefully! I provide this script “AS IS” with no warranties.

I don’t know why this has happened. I assume that the customer had this problem for some time. But due to some strange effects, he never noticed it. One hypothesis is, that the sequence of the DHCPACK messages after a DHCPINFORM has an influence. The DHCP server of ROOTDOM was moved to the new core switches, and maybe this changed the sequence of the answer packets. But it’s only a hypothesis, not a theory.

VMware Update Manager reports “error code 99” during scan operation

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After updating my lab to VMware vSphere 6.0 U2, one of my hosts continuously thrown an error during an update scan.

Host returns ESX error code 99, unhandled exception has occurred

The first thing I’ve checked was the esxupdate.log on the affected ESXi host. This is the output, that was logged during a scan operation.

2016-04-04T13:42:13Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetTimeout']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetRetries']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetRateLimit']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: esxupdate: INFO: ---
Command: scan
Args: ['scan']
Options: {'nosigcheck': None, 'retry': 5, 'loglevel': None, 'cleancache': None, 'viburls': None, 'meta': ['http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip', 'http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip'], 'proxyurl': None, 'timeout': 30.0, 'cachesize': None, 'hamode': True, 'maintenancemode': None}
2016-04-04T13:42:14Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/bootOption', '-rp']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/bootOption', '-ro']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip to /tmp/tmpWW6WJC...
2016-04-04T13:42:17Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:17Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip to /tmp/tmpKfdI64...
2016-04-04T13:42:20Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:21Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:22Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/usr/sbin/vsish', '-e', '-p', 'cat', '/hardware/bios/dmiInfo']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:22Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/smbiosDump']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:22Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:22Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:23Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:23Z esxupdate: Transaction: DEBUG: Populating VIB list from all VIBs in metadata http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip; depots:
2016-04-04T13:42:23Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip to /tmp/tmpFQrdX3...
2016-04-04T13:42:23Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:23Z esxupdate: Transaction: DEBUG: Populating VIB list from all VIBs in metadata http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip; depots:
2016-04-04T13:42:23Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip to /tmp/tmplZxgm6...
2016-04-04T13:42:23Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: An unexpected exception was caught:
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: Traceback (most recent call last):
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/usr/sbin/esxupdate", line 238, in main
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:     cmd.Run()
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/Cmdline.py", line 113, in Run
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 244, in Scan
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 106, in _generateOperationData
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 88, in _getInstallProfile
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: AttributeError: 'NoneType' object has no attribute 'Copy'
2016-04-04T13:42:24Z esxupdate: esxupdate: DEBUG: <<<

You might notice the “Unrecognized file vendor-index.xml in Metadata file” error. I also found this error message on the other hosts, so I excluded it from further research. It was unlikely, that this error was related to the observed problem. I started searching differences between the hosts and found out, that the output of “esxcli software vib list” was different on the faulty host.

This is the output on the faulty host:

[[email protected]:~] esxcli software vib list
Name         Version             Vendor  Acceptance Level  Install Date
-----------  ------------------  ------  ----------------  ------------
tools-light  6.0.0-2.34.3620759  VMware  VMwareCertified   2016-04-04
[[email protected]:~]

This is the output on a working host. You see the difference?

[[email protected]:~] esxcli software vib list
Name                           Version                                Vendor           Acceptance Level  Install Date
-----------------------------  -------------------------------------  ---------------  ----------------  ------------
net-tg3                        3.137l.v60.1-1OEM.600.0.0.2494585      BRCM             VMwareCertified   2016-03-03
elxnet                         10.5.121.7-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
ima-be2iscsi                   10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
lpfc                           10.5.70.0-1OEM.600.0.0.2159203         EMU              VMwareCertified   2016-03-03
scsi-be2iscsi                  10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
scsi-lpfc820                   8.2.4.157.70-1OEM.500.0.0.472560       Emulex           VMwareCertified   2016-03-03
hpe-build                      600.9.4.5.11-2494585                   HPE              PartnerSupported  2016-03-03
char-hpcru                     6.0.6.14-1OEM.600.0.0.2159203          Hewlett-Packard  PartnerSupported  2016-03-03
...
...
...
tools-light                    6.0.0-2.34.3620759                     VMware           VMwareCertified   2016-04-04
scsi-qla2xxx                   934.5.20.0-1OEM.500.0.0.472560         qlogic           VMwareCertified   2016-03-03
[[email protected]:~]

Doesn’t look right… I investigated further, still searching for differences. And then I found two empty directories under /var/db/esximg.

[[email protected]:~] ls -l /var/db/esximg/*
/var/db/esximg/profiles:
total 0

/var/db/esximg/vibs:
total 0
[[email protected]:~]

The same directory was populated on other, working hosts.

[[email protected]:~] ls -l /var/db/esximg/*
/var/db/esximg/profiles:
total 20
-r--r--r--    1 root     root         20238 Apr  4 11:23 %28Updated%29%20HPE-ESXi-6.0.0-Update1-iso-600.9.41594098800

/var/db/esximg/vibs:
total 732
-r--r--r--    1 root     root          1704 Apr  4 11:23 ata-pata-amd--1600059064.xml
-r--r--r--    1 root     root          1728 Apr  4 11:23 ata-pata-atiixp--1227646244.xml
-r--r--r--    1 root     root          1719 Apr  4 11:23 ata-pata-cmd64x-782653683.xml
-r--r--r--    1 root     root          1748 Apr  4 11:23 ata-pata-hpt3x2n-852032191.xml
-r--r--r--    1 root     root          1730 Apr  4 11:23 ata-pata-pdc2027x-236283737.xml
...
...
...
-r--r--r--    1 root     root         16416 Apr  4 11:23 vsanhealth--1252089272.xml
-r--r--r--    1 root     root          1726 Apr  4 11:23 xhci-xhci-1668869473.xml
[[email protected]:~]

One possible solution was therefore to copy the missing files to the faulty host. I used SCP for this. To get SCP working, you have to enable the SSH Client in the ESXi firewall.

[[email protected]:/var/db] esxcli network firewall ruleset set --enabled true --ruleset-id=sshClient

After that, I’ve copied the files from a working host to the faulty host. Please make sure that the hosts have the same build! In my case, both hosts had the same build. Don’t try to copy files from an older or newer build to the host!!

[[email protected]:/var/db] scp -r esximg/ [email protected]:/var/db
The authenticity of host 'esx1 (192.168.200.33)' can't be established.
RSA key fingerprint is SHA256:OSzz9Kk4QDRtmj7ed2J+1qcniIhBVJuJVEKf/4+Gry4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'esx1,192.168.200.33' (RSA) to the list of known hosts.
Password:
sata-sata-sil-1748273158.xml                                                                                                                                                                               100% 1717     1.7KB/s   00:00
lsi-mr3-989864457.xml                                                                                                                                                                                      100% 1751     1.7KB/s   00:00
scsi-ips--1979861494.xml                                                                                                                                                                                   100% 1619     1.6KB/s   00:00
char-hpcru--1874046437.xml                                                                                                                                                                                 100% 1638     1.6KB/s   00:00
scsi-lpfc820--634308064.xml                                                                                                                                                                                100% 1663     1.6KB/s   00:00
net-tg3--917722591.xml                                                                                                                                                                                     100% 1707     1.7KB/s   00:00
ipmi-ipmi-devintf-1862766627.xml                                                                                                                                                                           100% 1719     1.7KB/s   00:00
...
...
...
scsi-aic79xx-757558775.xml                                                                                                                                                                                 100% 1643     1.6KB/s   00:00
hp-ams-1212738556.xml                                                                                                                                                                                      100% 2035     2.0KB/s   00:00
%28Updated%29%20HPE-ESXi-6.0.0-Update1-iso-600.9.41594098800                                                                                                                                               100%   20KB  19.8KB/s   00:00
[[email protected]:/var/db]

And because we are pros, we disable the SSH Client after using it.

[[email protected]:/var/db] esxcli network firewall ruleset set --enabled false --ruleset-id=sshClient

As expected, “esxcli software vib list” was working again.

[[email protected]:~] esxcli software vib list
Name                           Version                                Vendor           Acceptance Level  Install Date
-----------------------------  -------------------------------------  ---------------  ----------------  ------------
net-tg3                        3.137l.v60.1-1OEM.600.0.0.2494585      BRCM             VMwareCertified   2016-03-03
elxnet                         10.5.121.7-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
ima-be2iscsi                   10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
lpfc                           10.5.70.0-1OEM.600.0.0.2159203         EMU              VMwareCertified   2016-03-03
scsi-be2iscsi                  10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
scsi-lpfc820                   8.2.4.157.70-1OEM.500.0.0.472560       Emulex           VMwareCertified   2016-03-03
hpe-build                      600.9.4.5.11-2494585                   HPE              PartnerSupported  2016-03-03
char-hpcru                     6.0.6.14-1OEM.600.0.0.2159203          Hewlett-Packard  PartnerSupported  2016-03-03
...
...
...
tools-light                    6.0.0-2.34.3620759                     VMware           VMwareCertified   2016-04-04
scsi-qla2xxx                   934.5.20.0-1OEM.500.0.0.472560         qlogic           VMwareCertified   2016-03-03
[[email protected]:~]

A rescan operation in the vSphere Client was also successful. It seems that the root cause for the problem were missing files under /var/db/esximg.

Please don’t ask why this has happened. I really have no idea. But VMware KB2043170 (Initializing the VMware vCenter Update Manager database without reinstalling it) isn’t always the solution for “error code 99”, as sometimes written somewhere in the internet. Always try to analyze the problem and try to filter out unlikely and likely solutions.

Data Protector: Exchange backup failes because of database lock

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had a customer call, where a Exchange 2010 backup repeatedly failed. HPE Data Protector was unable to create a differential or incremental backup. For each database, the following error was logged:

[Minor] From: OB2BAR_E2010_BAR@exchangeserver.domain.tld "MS Exchange 2010+ Server"  Time: 21.03.2016 20:00:27
[170:313] 	One or more copies of database DATABASE are already being backed up in a different session.

Interestingly, there was no other backup session running. But the night before, the backup jobs failed because of a network failure.

The solution is easy. This error is caused by a wrong information in the Data Protector database. To remove this, open an administrative CMD on the Data Protector Cell Manager and run this omnidbutil command:

C:\Users\Administrator>omnidbutil -free_cell_resources
DONE!

C:\Users\Administrator>

This command  will free up the locked resources in the Data Protector database.Then, run the job again.

ALE OmniSwitch stack does not form due to incompatible licenses

This posting is ~7 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I saw an interesting behaviour of two Alcatel-Lucent Enterprise OmniSwitch 6450. Both switches has been configured as a stack, but one of the switches showed a flashing ID after the startup, and the stack was not formed. While I checked the logs and the status of the stack, I noticed that the slot number was incorrect. Furthermore the status showed “INC-LIC”.

-> show stack topology
                                         Link A  Link A          Link B  Link B
NI      Role      State   Saved  Link A  Remote  Remote  Link B  Remote  Remote
                          Slot   State   NI      Port    State   NI      Port
----+-----------+--------+------+-------+-------+-------+-------+-------+-------
   1 PRIMARY     RUNNING    1    UP       1001   StackB  DOWN        0        0
1001 PASS-THRU   INC-LIC    2    DOWN        0        0  UP          1   StackA

-> show log swlog
<snip>
THU MAR 03 13:07:29 2016  STACK-MANAGER    info == SM == Stack Port A Status Changed: DOWN
THU MAR 03 13:07:29 2016  STACK-MANAGER    info == SM == NI 0 down notification sent to LAG
THU MAR 03 13:08:41 2016  STACK-MANAGER    info == SM == Stack Port A Status Changed: UP
THU MAR 03 13:08:41 2016  STACK-MANAGER    info == SM == Stack Port A MAC Frames TX/RX Enabled
THU MAR 03 13:08:42 2016  STACK-MANAGER    info  Retaining Module Id for slot 2 unit 0 as 1
THU MAR 03 13:08:46 2016  STACK-MANAGER    info == SM == An element enters passthru mode (incompatible license)
<snip>

According to the stack status and the switch logs, there seems to be a problem with the licenses. So I checked the installed licenses on both switches. On switch showed Metro license:

-> show license info
 NI          Application           License Type                Time Left
-----------+---------------------+---------------------------+--------------
 1              METRO                 Permanent                   0
 1              10G                   Permanent                   0
 1001           10G                   Permanent                   0

The other switch not:

 -> show license info
 NI          Application           License Type                Time Left
-----------+---------------------+---------------------------+--------------
 1              10G                   Permanent                   0

Don’t be confused because of the slot numbering. I pulled the stacking cable.

The solution was easy: I removed the metro license and after a reboot of the switch, from which I removed the license, the stack formed properly.

-> show stack topology
                                         Link A  Link A          Link B  Link B
NI      Role      State   Saved  Link A  Remote  Remote  Link B  Remote  Remote
                          Slot   State   NI      Port    State   NI      Port
----+-----------+--------+------+-------+-------+-------+-------+-------+-------
   1 SECONDARY   RUNNING    1    UP          2   StackB  DOWN        0        0
   2 PRIMARY     RUNNING    2    DOWN        0        0  UP          1   StackA