Tag Archives: troubleshooting

Checking the 3PAR Quorum Witness appliance

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

3PAR Quorum Witness Status

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

How to check the Quorum Witness appliance?

You can check the status of the appliance with a simple web request. The documentation shows a simple test based on curl. You can run this direct from the BASH of the appliance.

[[email protected] ~]# curl http://10.0.0.99:8080
{"couchdb":"Welcome","version":"1.0.4"}
[[email protected] ~]#

But you can also use the PowerShell cmdlet Invoke-WebRequest.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080


StatusCode        : 200
StatusDescription : OK
Content           : {"couchdb":"Welcome","version":"1.0.4"}

RawContent        : HTTP/1.1 200 OK
                    Content-Length: 40
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:31:37 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"couchdb...
Forms             : {}
Headers           : {[Content-Length, 40], [Cache-Control, must-revalidate], [Content-Type, text/plain;charset=utf-8],
                    [Date, Mon, 30 Jan 2017 08:31:37 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 40

If you add /witness to the URL, you can test the access to the database, which is used for Peer Persistence.

PS C:\Users\patrick> Invoke-WebRequest -Uri http://10.0.0.99:8080/witness


StatusCode        : 200
StatusDescription : OK
Content           : {"db_name":"witness","doc_count":5,"doc_del_count":4,"update_seq":149557915,"purge_seq":0,"compact_
                    running":false,"disk_size":48988254,"instance_start_time":"1485763322826940","disk_format_version":
                    5,...
RawContent        : HTTP/1.1 200 OK
                    Content-Length: 234
                    Cache-Control: must-revalidate
                    Content-Type: text/plain;charset=utf-8
                    Date: Mon, 30 Jan 2017 08:36:38 GMT
                    Server: CouchDB/1.0.4 (Erlang OTP/R14B04)

                    {"db_nam...
Forms             : {}
Headers           : {[Content-Length, 234], [Cache-Control, must-revalidate], [Content-Type,
                    text/plain;charset=utf-8], [Date, Mon, 30 Jan 2017 08:36:38 GMT]...}
Images            : {}
InputFields       : {}
Links             : {}
ParsedHtml        : mshtml.HTMLDocumentClass
RawContentLength  : 234

If you get a connection error, check if the beam process is running.

[[email protected] ~]# netstat -tulpen |grep 8080
tcp        0      0 0.0.0.0:8080                0.0.0.0:*                   LISTEN      495        10726      1643/beam
[[email protected] ~]#

If not, reboot the appliance. This can be done without downtime. The appliance comes only into play, if a failover occurs.

Replacing an expired lookup service SSL certificate on a vSphere PSC

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A few days ago, I ran into a very nasty problem. Fortunately, it was in my lab. Some months ago, I replaced the certificates of my vCenter Server Appliance (VCSA), and I’ve chosen to use the VMware Certificate Authority (VMCA) as a subordinate of my AD-based enterprise CA. The VMCA was used as intermediate CA. The certificates were replaced using the  vSphere 6.0 Certificate Manager (/usr/lib/vmware-vmca/bin/certificate-manager), and I followed the instructions of KB2112016 (Configuring VMware vSphere 6.0 VMware Certificate Authority as a subordinate Certificate Authority).

The VCSA was migrated from vSphere 5.5, and with vSphere 5.5 I was also using custom certificates. These certificates were also issued by my AD-based enterprise CA, and these certificates were migration during the vSphere 5.5 > 6.0 migration. So at the end, I replaced custom certificates with VMCA (as an intermediate CA) certificates.

Everything was fine, until a power outage. After powering-on my VMs, I noticed several errors. After logging into the vSphere Web Client, I got an error message at the top of the page:

Error occurred while processing request. Check vSphere WebClient logs for details.

While searching for the cause, I checked the URL of the Platform Services Controller (https://vcsa1.lab.local/psc/login) and got this:

psc_error_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

HTTP Status 400 - An error occurred while sending an authentication request to the PSC Single Sign-On server - null

type Status report
message An error occurred while sending an authentication request to the PSC Single Sign-On server - null
description The request sent by the client was syntactically incorrect.

This error led me to KB2144086 (Updating certificates using certificate manager on vCenter Server or PSC 6.0 Update 1b fails), but was able to proof, that I have used different subject names for the different solution user certificates.

While digging in the PSC logs, I found this error in the /var/log/vmware/psc-client/psc-client.log:

Caused by: com.vmware.vim.vmomi.client.exception.VlsiCertificateException: Server certificate chain is not trusted and thumbprint doesn't match
        at com.vmware.vim.vmomi.client.http.impl.ThumbprintTrustManager.checkServerTrusted(ThumbprintTrustManager.java:217)
        at sun.security.ssl.AbstractTrustManagerWrapper.checkServerTrusted(Unknown Source)

        ... 71 more

Finally, I found Aaron Smiths blog post “Troubleshooting Expired PSC Certificates with vSphere 6“, who had the same problem. I checked the certificate of the Lookup Service and there it was:

psc_error_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This was the original custom certificate, issued by my AD-based enterprise CA, and installed on my vSphere 5.5 VCSA.

Aaron also offered the solution by referencing KB2118939 (Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0). I followed the instructions in KB2118939 and replaced the certificate of the Lookup Service with a certificate of the VMCA.

Take care of your certificates

With vSphere 6.0, the Lookup Service should be accessed through the HTTP Reverse Proxy. This proxy uses the machine certificate. Therefore, an expired Lookup Certificate is not obvious. If you connect directly to the Lookup Service using port 7444, you will see the expired certificate. The Lookup Service certificate is not replaced with a custom certificate, if you replace the different solution user certificates.

If you have a vSphere 6.0 VCSA, which was migrated from vSphere 5.5, and you have replaced the certificates on that vSphere 5.5 VCSA with custom certificates, you should check your Lookup Service certificate immidiately! Follow KB2118939 for further instructions.

Credit to Aaron Smith for this blog post. Thank you!

Solving problems: A structured approach

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Every problem has characteristics, that can be used to describe it.

  • Solubility

Not every problem is solvable. Think about “Squaring the circle“. But often a problem seems to be unsolvable because it’s not well defined. If Initial situation, obstacle and target situation are not clearly formulated, you won’t be able to solve the problem.

  • Decomposability

If you can decompose a problem into multiple subproblems, it is a hierarchical problem. Otherwise, it’s an elemental problem.

  • Effort

The effort to solve a problem is always different.. A problem is theoretically solvable, but it may require such a high effort, that it is practically unsolvable.

  • Subjectivity

Even if a problem is well defined, it appears different in regard to complexity for different people.

How to start?

First of all

  • Understand and define the problem

This is most important part. Before you try to solve a problem, make sure that you have really understood the problem. Then you should define the problem. Only a clearly defined problem can be solved. And it’s much easier to solve a clearly defined, than a vague problem. If it’s a complex problem, then you should try to

  • Simplify or decompose the problem

A simplified problem can help you stay focused. If you can’t simplify a problem, you can try to divide it into subproblems. With a clearly defined, simplified/ structured problem, you can start to

  • Find the root cause

Collecting information is the key. Collect information about what happened before, during, and after the problem has occurred. Identifying the root cause for a problem can be a time consuming task. But let me say this clearly: Information is the key. Information that help to find the root cause are not only observations (e.g. logs, error messages etc.). You can can use the results of systematic tests. Collect as much data as you can.

Sometimes it can be useful to create a hypothesis.

Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.

If you see that System A is affected, but system B should be affected too, but it’s not, it might be time to create a hypothesis. With a hypothesis in mind, you can try to prove it. Test the hypothesis by performing tests and collecting data. This strategy is called “hypothesis testing”.

At some point, you should have identified the root cause. With the now known root cause, you can

  • Create solutions and select the best one

Sometimes it’s easy. But sometimes it’ not that easy. A trade-off analysis can help to identify the best of multiple solutions.

  • Create an action plan

Even if you only have to disable a specific feature, it’s a good idea to formulate an action plan. Even if consists only of three lines… You should state clearly

  • WHAT you do
  • WHY do you have to do it, and
  • HOW to you plan to check it

With these steps, you should be well prepared. It doesn’t matter what kind of problem you are trying to solve: The process is basically the same.

Other problem solving methods

Over the years many problem solving methods have been developed. Kepner-Tregoe is one of them. Other well known methods are:

  • A3 Problem Solving
  • PDCA
  • Eight Disciplines (8D) Problem Solving
  • Failure mode and effects analysis (FMEA)

A3 Problem Solving has been developed at Toyota for their Toyota Production System (TPS). It’s an often used method in Lean Manufacturing. A3 helps to solve problems by pretending a structure (WHAT IS and WHAT IS NOT the problem, describe the problem, root cause, solution etc). This strucure is placed on an A3 sheet paper (that why it’s called A3). The process is based on the principles of Deming’s PDCA cycle.

PDCA, or Plan-Do-Check-Act (sometimes Shewhart-Cycle) was made popular by Dr. Edwards Deming. Plan-Do-Check-Act refers to the four phases of this cycle.

  • Plan: Plan the change
  • Do: implement the change
  • Check: Check the sucess of the implemented change
  • Act: Take action based on the results of “Check”

Eight Disciplines (8D) Problem Solving was developed by the Ford Motor Company. The D0 phase is the starting point for the D8 process, but it’s not counted.

  • D0:  Plan for solving the problem and determine the prerequisites
  • D1: Establish a team of people with the required skills and knowledge
  • D2: Describe the problem
  • D3: Define and implement containment actions
  • D4: Determine and verify the root causes
  • D5: Plan permanent corrective actions for the observed problem
  • D6: Implement the best permanent corrective actions
  • D7: Modify management systems to prevent a recurrence
  • D8: Congratulate your team!

The Failure mode and effects analysis (FMEA) is a highly structured, systematic approach for failure analysis. There are different FMEA alalyses:

  • Functional
  • Design
  • Process

FMEA is based on inductive reasoning (forward logic). FMEA is based on a highly structured process, which can be represented as followed.

  • Structural analysis: A system is divided into its components
  • Functional analysis: Identify the function of each component
  • Failure analysis: Identify the possible failures for each component
  • Calculate the risk: Risk Priority Number = occurrence ranking x detection ranking x highest severity ranking
  • Optimize: Optimize the component to mitigate the risk

No matter what, stay organized

The key to successfully solve problems is to stay organized. Solving problems isn’t magic. It is a very structured process that gets better with increasing experience. Try to create your own, structured method. Or use one of the mentioned problem solving methods. But in general:

  • Always try to describe a problem
  • Try to simplify or break it into smaller problems
  • Search and verify for the root cause
  • Develop a solution

Windows recieves wrong DNS server from DHCP after DHCPINFORM

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, I was surprisingly booked by a customer who observed a problem in his network. Unfortunately, colleagues worked on this network some day before (moving servers, routers etc. to a new pair of HP 7509 new core switches).

It was quickly clear, that some of the clients have received the wrong DNS servers from the DHCP server. The environment is a bit unusual. The customer is running two Active Directory domains (root and sub domain) in a single layer 2 broadcast domain. This nothing unusual, but he is also running two DHCP servers in the same layer 2 broadcast domain. To get this working, the customer uses exclusion ranges and reservations. This guarantees, that the client receives the correct DHCP information.

Observations

It was quicky clear, that some of the clients have received the wrong DNS servers from the DHCP server. That is the (defaced) output of a SUBDOM client with the correct IP settings.

   IPv4 Address  . . . . . . . . . . : 10.1.1.146(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . : 10.1.1.1
   DHCP Server . . . . . . . . . . . : 10.1.1.2
   DNS Servers . . . . . . . . . . . : 10.1.1.2
                                       10.10.1.2

And this is the output of the same client, after a reboot:

   IPv4 Address  . . . . . . . . . . : 10.1.1.146(Preferred)
   Subnet Mask . . . . . . . . . . . : 255.255.0.0
   Default Gateway . . . . . . . . . : 10.1.1.1
   DHCP Server . . . . . . . . . . . : 10.1.1.2
   DNS Servers . . . . . . . . . . . : 172.16.1.2
                                       172.16.1.3

As you can see, the client got its DHCP information from the same DHCP server, but with the wrong DNS settings. The DNS servers are the servers from the ROOTDOM.

  • only clients from SUBDOM were affected
  • only Windows XP and Windows 7 clients were affected
  • Windows 8.1 and Windows 10 clients were not affected
  • Igel Thin Clients were not affected
  • after an “ipconfig /renew”, the correct DNS servers were registered
  • after an “ipconfig /release” and “ipconfig /renew”, the wrong DNS servers were registered
  • the same happened after a reboot
  • Wireshard packet trace showed, that the correct DNS information were included in the DHCPOFFER

The sum of observation told me, that this has nothing (or less) to do with the network changes. Interestingly, the correct DNS information were included in the DHCPOFFER and the behaviour was only observed on Windows XP and Windows 7. In addition, only clients of SUBDOM were affected.

The smoking gun

The packet trace with Wireshark showed, that the correct DNS information was included in the DHCPOFFER. But I also saw, that the client has sent a DHCPINFORM, which was answered by all available DHCP servers.

dhcp_inform

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

DHCPINFORM is used by a client to discover more information, e.g. router, proxy, static routes… or DNS. The DHCPINFORM request was only sent after a reboot, or after a DHCPRELEASE and a subsequent DHCPDISCOVER. I saw that all available DHCP servers answered the DHCPINFORM request with a DHCPACK. This DHCPACK included the requested information, including DNS. I quickly developed the hypothesis, that the DHCPACK from the ROOTDOM DHCP servers was used by the client, to add the wrong DNS information to the configuration.

With this information, I quickly found a references (MerakiLaurent Gaffié) to a registry key, that can be used to disable DHCPINFORM. This registry key is valid for Windows 2000, 2003, Windows XP, Windows Vista and Windows 7. Especially the blog post from Laurent Gaffié gave is interesting:

A vulnerability in Windows DHCP (http://www.ietf.org/rfc/rfc2131.txt) was found on Windows OS versions ranging from Windows 2000 through to Windows server 2003.  This vulnerability allows an attacker to remotely overwrite DNS, Gateway, IP Addresses, routing, WINS server, WPAD, and server configuration with no user interaction.

It’s useful to disable DHCPINFORM, even if you don’t have a problem!

Disable DHCPINFORM

To disable DHCPINFORM, you must add a registry key for the network interface, that shouldn’t sent DHCPINFORM messages.

HKEY_LOCAL_MACHINE\SYSTEM\CurrentControlSet\Services\Tcpip\Parameters\Interfaces\{Interface GUID} 
Value Type: DWORD
Value name: UseInform
DWORD Value: 0

Unfortunately the GUID of the interface differs between clients. I build this Visual Basic script (like Dr. Frankenstein: Different sources, plugged together, but it works) to add the registry key. You can run this script as part of a startup script with a Group Policy.

You should test this script very carefully! I provide this script “AS IS” with no warranties.

I don’t know why this has happened. I assume that the customer had this problem for some time. But due to some strange effects, he never noticed it. One hypothesis is, that the sequence of the DHCPACK messages after a DHCPINFORM has an influence. The DHCP server of ROOTDOM was moved to the new core switches, and maybe this changed the sequence of the answer packets. But it’s only a hypothesis, not a theory.

VMware Update Manager reports “error code 99” during scan operation

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After updating my lab to VMware vSphere 6.0 U2, one of my hosts continuously thrown an error during an update scan.

Host returns ESX error code 99, unhandled exception has occurred

The first thing I’ve checked was the esxupdate.log on the affected ESXi host. This is the output, that was logged during a scan operation.

2016-04-04T13:42:13Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetTimeout']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetRetries']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/esxcfg-advcfg', '-q', '-g', '/UserVars/EsximageNetRateLimit']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: esxupdate: INFO: ---
Command: scan
Args: ['scan']
Options: {'nosigcheck': None, 'retry': 5, 'loglevel': None, 'cleancache': None, 'viburls': None, 'meta': ['http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip', 'http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip'], 'proxyurl': None, 'timeout': 30.0, 'cachesize': None, 'hamode': True, 'maintenancemode': None}
2016-04-04T13:42:14Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/bootOption', '-rp']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/bootOption', '-ro']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:14Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip to /tmp/tmpWW6WJC...
2016-04-04T13:42:17Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:17Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip to /tmp/tmpKfdI64...
2016-04-04T13:42:20Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:21Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:22Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/usr/sbin/vsish', '-e', '-p', 'cat', '/hardware/bios/dmiInfo']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:22Z esxupdate: vmware.runcommand: INFO: runcommand called with: args = '['/sbin/smbiosDump']', outfile = 'None', returnoutput = 'True', timeout = '0.0'.
2016-04-04T13:42:22Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:22Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:23Z esxupdate: BootBankInstaller.pyc: DEBUG: Creating an empty ImageProfile for bootbank /bootbank
2016-04-04T13:42:23Z esxupdate: Transaction: DEBUG: Populating VIB list from all VIBs in metadata http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip; depots:
2016-04-04T13:42:23Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/10960002/metadata_1456989617.zip to /tmp/tmpFQrdX3...
2016-04-04T13:42:23Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:23Z esxupdate: Transaction: DEBUG: Populating VIB list from all VIBs in metadata http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip; depots:
2016-04-04T13:42:23Z esxupdate: downloader: DEBUG: Downloading http://vum.lab.local:9084/vum/repository/hostupdate/vmw/vmw-ESXi-6.0.0-metadata.zip to /tmp/tmplZxgm6...
2016-04-04T13:42:23Z esxupdate: Metadata.pyc: INFO: Unrecognized file vendor-index.xml in Metadata file
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: An unexpected exception was caught:
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: Traceback (most recent call last):
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/usr/sbin/esxupdate", line 238, in main
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:     cmd.Run()
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/Cmdline.py", line 113, in Run
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 244, in Scan
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 106, in _generateOperationData
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR:   File "/build/mts/release/bora-3620759/bora/build/esx/release/vmvisor/sys-boot/lib/python2.7/site-packages/vmware/esx5update/MetadataScanner.py", line 88, in _getInstallProfile
2016-04-04T13:42:24Z esxupdate: esxupdate: ERROR: AttributeError: 'NoneType' object has no attribute 'Copy'
2016-04-04T13:42:24Z esxupdate: esxupdate: DEBUG: <<<

You might notice the “Unrecognized file vendor-index.xml in Metadata file” error. I also found this error message on the other hosts, so I excluded it from further research. It was unlikely, that this error was related to the observed problem. I started searching differences between the hosts and found out, that the output of “esxcli software vib list” was different on the faulty host.

This is the output on the faulty host:

[[email protected]:~] esxcli software vib list
Name         Version             Vendor  Acceptance Level  Install Date
-----------  ------------------  ------  ----------------  ------------
tools-light  6.0.0-2.34.3620759  VMware  VMwareCertified   2016-04-04
[[email protected]:~]

This is the output on a working host. You see the difference?

[[email protected]:~] esxcli software vib list
Name                           Version                                Vendor           Acceptance Level  Install Date
-----------------------------  -------------------------------------  ---------------  ----------------  ------------
net-tg3                        3.137l.v60.1-1OEM.600.0.0.2494585      BRCM             VMwareCertified   2016-03-03
elxnet                         10.5.121.7-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
ima-be2iscsi                   10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
lpfc                           10.5.70.0-1OEM.600.0.0.2159203         EMU              VMwareCertified   2016-03-03
scsi-be2iscsi                  10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
scsi-lpfc820                   8.2.4.157.70-1OEM.500.0.0.472560       Emulex           VMwareCertified   2016-03-03
hpe-build                      600.9.4.5.11-2494585                   HPE              PartnerSupported  2016-03-03
char-hpcru                     6.0.6.14-1OEM.600.0.0.2159203          Hewlett-Packard  PartnerSupported  2016-03-03
...
...
...
tools-light                    6.0.0-2.34.3620759                     VMware           VMwareCertified   2016-04-04
scsi-qla2xxx                   934.5.20.0-1OEM.500.0.0.472560         qlogic           VMwareCertified   2016-03-03
[[email protected]:~]

Doesn’t look right… I investigated further, still searching for differences. And then I found two empty directories under /var/db/esximg.

[[email protected]:~] ls -l /var/db/esximg/*
/var/db/esximg/profiles:
total 0

/var/db/esximg/vibs:
total 0
[[email protected]:~]

The same directory was populated on other, working hosts.

[[email protected]:~] ls -l /var/db/esximg/*
/var/db/esximg/profiles:
total 20
-r--r--r--    1 root     root         20238 Apr  4 11:23 %28Updated%29%20HPE-ESXi-6.0.0-Update1-iso-600.9.41594098800

/var/db/esximg/vibs:
total 732
-r--r--r--    1 root     root          1704 Apr  4 11:23 ata-pata-amd--1600059064.xml
-r--r--r--    1 root     root          1728 Apr  4 11:23 ata-pata-atiixp--1227646244.xml
-r--r--r--    1 root     root          1719 Apr  4 11:23 ata-pata-cmd64x-782653683.xml
-r--r--r--    1 root     root          1748 Apr  4 11:23 ata-pata-hpt3x2n-852032191.xml
-r--r--r--    1 root     root          1730 Apr  4 11:23 ata-pata-pdc2027x-236283737.xml
...
...
...
-r--r--r--    1 root     root         16416 Apr  4 11:23 vsanhealth--1252089272.xml
-r--r--r--    1 root     root          1726 Apr  4 11:23 xhci-xhci-1668869473.xml
[[email protected]:~]

One possible solution was therefore to copy the missing files to the faulty host. I used SCP for this. To get SCP working, you have to enable the SSH Client in the ESXi firewall.

[[email protected]:/var/db] esxcli network firewall ruleset set --enabled true --ruleset-id=sshClient

After that, I’ve copied the files from a working host to the faulty host. Please make sure that the hosts have the same build! In my case, both hosts had the same build. Don’t try to copy files from an older or newer build to the host!!

[[email protected]:/var/db] scp -r esximg/ [email protected]:/var/db
The authenticity of host 'esx1 (192.168.200.33)' can't be established.
RSA key fingerprint is SHA256:OSzz9Kk4QDRtmj7ed2J+1qcniIhBVJuJVEKf/4+Gry4.
Are you sure you want to continue connecting (yes/no)? yes
Warning: Permanently added 'esx1,192.168.200.33' (RSA) to the list of known hosts.
Password:
sata-sata-sil-1748273158.xml                                                                                                                                                                               100% 1717     1.7KB/s   00:00
lsi-mr3-989864457.xml                                                                                                                                                                                      100% 1751     1.7KB/s   00:00
scsi-ips--1979861494.xml                                                                                                                                                                                   100% 1619     1.6KB/s   00:00
char-hpcru--1874046437.xml                                                                                                                                                                                 100% 1638     1.6KB/s   00:00
scsi-lpfc820--634308064.xml                                                                                                                                                                                100% 1663     1.6KB/s   00:00
net-tg3--917722591.xml                                                                                                                                                                                     100% 1707     1.7KB/s   00:00
ipmi-ipmi-devintf-1862766627.xml                                                                                                                                                                           100% 1719     1.7KB/s   00:00
...
...
...
scsi-aic79xx-757558775.xml                                                                                                                                                                                 100% 1643     1.6KB/s   00:00
hp-ams-1212738556.xml                                                                                                                                                                                      100% 2035     2.0KB/s   00:00
%28Updated%29%20HPE-ESXi-6.0.0-Update1-iso-600.9.41594098800                                                                                                                                               100%   20KB  19.8KB/s   00:00
[[email protected]:/var/db]

And because we are pros, we disable the SSH Client after using it.

[[email protected]:/var/db] esxcli network firewall ruleset set --enabled false --ruleset-id=sshClient

As expected, “esxcli software vib list” was working again.

[[email protected]:~] esxcli software vib list
Name                           Version                                Vendor           Acceptance Level  Install Date
-----------------------------  -------------------------------------  ---------------  ----------------  ------------
net-tg3                        3.137l.v60.1-1OEM.600.0.0.2494585      BRCM             VMwareCertified   2016-03-03
elxnet                         10.5.121.7-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
ima-be2iscsi                   10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
lpfc                           10.5.70.0-1OEM.600.0.0.2159203         EMU              VMwareCertified   2016-03-03
scsi-be2iscsi                  10.5.101.0-1OEM.600.0.0.2159203        EMU              VMwareCertified   2016-03-03
scsi-lpfc820                   8.2.4.157.70-1OEM.500.0.0.472560       Emulex           VMwareCertified   2016-03-03
hpe-build                      600.9.4.5.11-2494585                   HPE              PartnerSupported  2016-03-03
char-hpcru                     6.0.6.14-1OEM.600.0.0.2159203          Hewlett-Packard  PartnerSupported  2016-03-03
...
...
...
tools-light                    6.0.0-2.34.3620759                     VMware           VMwareCertified   2016-04-04
scsi-qla2xxx                   934.5.20.0-1OEM.500.0.0.472560         qlogic           VMwareCertified   2016-03-03
[[email protected]:~]

A rescan operation in the vSphere Client was also successful. It seems that the root cause for the problem were missing files under /var/db/esximg.

Please don’t ask why this has happened. I really have no idea. But VMware KB2043170 (Initializing the VMware vCenter Update Manager database without reinstalling it) isn’t always the solution for “error code 99”, as sometimes written somewhere in the internet. Always try to analyze the problem and try to filter out unlikely and likely solutions.

Data Protector: Exchange backup failes because of database lock

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had a customer call, where a Exchange 2010 backup repeatedly failed. HPE Data Protector was unable to create a differential or incremental backup. For each database, the following error was logged:

[Minor] From: OB2BAR_E2010_BAR@exchangeserver.domain.tld "MS Exchange 2010+ Server"  Time: 21.03.2016 20:00:27
[170:313] 	One or more copies of database DATABASE are already being backed up in a different session.

Interestingly, there was no other backup session running. But the night before, the backup jobs failed because of a network failure.

The solution is easy. This error is caused by a wrong information in the Data Protector database. To remove this, open an administrative CMD on the Data Protector Cell Manager and run this omnidbutil command:

C:\Users\Administrator>omnidbutil -free_cell_resources
DONE!

C:\Users\Administrator>

This command  will free up the locked resources in the Data Protector database.Then, run the job again.

ALE OmniSwitch stack does not form due to incompatible licenses

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I saw an interesting behaviour of two Alcatel-Lucent Enterprise OmniSwitch 6450. Both switches has been configured as a stack, but one of the switches showed a flashing ID after the startup, and the stack was not formed. While I checked the logs and the status of the stack, I noticed that the slot number was incorrect. Furthermore the status showed “INC-LIC”.

-> show stack topology
                                         Link A  Link A          Link B  Link B
NI      Role      State   Saved  Link A  Remote  Remote  Link B  Remote  Remote
                          Slot   State   NI      Port    State   NI      Port
----+-----------+--------+------+-------+-------+-------+-------+-------+-------
   1 PRIMARY     RUNNING    1    UP       1001   StackB  DOWN        0        0
1001 PASS-THRU   INC-LIC    2    DOWN        0        0  UP          1   StackA

-> show log swlog
<snip>
THU MAR 03 13:07:29 2016  STACK-MANAGER    info == SM == Stack Port A Status Changed: DOWN
THU MAR 03 13:07:29 2016  STACK-MANAGER    info == SM == NI 0 down notification sent to LAG
THU MAR 03 13:08:41 2016  STACK-MANAGER    info == SM == Stack Port A Status Changed: UP
THU MAR 03 13:08:41 2016  STACK-MANAGER    info == SM == Stack Port A MAC Frames TX/RX Enabled
THU MAR 03 13:08:42 2016  STACK-MANAGER    info  Retaining Module Id for slot 2 unit 0 as 1
THU MAR 03 13:08:46 2016  STACK-MANAGER    info == SM == An element enters passthru mode (incompatible license)
<snip>

According to the stack status and the switch logs, there seems to be a problem with the licenses. So I checked the installed licenses on both switches. On switch showed Metro license:

-> show license info
 NI          Application           License Type                Time Left
-----------+---------------------+---------------------------+--------------
 1              METRO                 Permanent                   0
 1              10G                   Permanent                   0
 1001           10G                   Permanent                   0

The other switch not:

 -> show license info
 NI          Application           License Type                Time Left
-----------+---------------------+---------------------------+--------------
 1              10G                   Permanent                   0

Don’t be confused because of the slot numbering. I pulled the stacking cable.

The solution was easy: I removed the metro license and after a reboot of the switch, from which I removed the license, the stack formed properly.

-> show stack topology
                                         Link A  Link A          Link B  Link B
NI      Role      State   Saved  Link A  Remote  Remote  Link B  Remote  Remote
                          Slot   State   NI      Port    State   NI      Port
----+-----------+--------+------+-------+-------+-------+-------+-------+-------
   1 SECONDARY   RUNNING    1    UP          2   StackB  DOWN        0        0
   2 PRIMARY     RUNNING    2    DOWN        0        0  UP          1   StackA

Using VCSA as remote syslog – Don’t forget the log rotation!

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Important note: It seems that vCenter Server Appliance updates revert the changes. Please check the settings after each update!

The VMware vCenter Server Appliance (VCSA) can act as a remote syslog destition for ESXi hosts. This is very handy for troubleshooting and I really recommend to use this feature.  But VMware ESXi hosts can be really chatty and therefore it’s a good idea to keep an eye on the free disk space of the VCSA.

Yesterday, a colleague had an interesting support case. A customer reported that his Veeam Backup & Replication jobs failed and that he was unable to login to the vCenter with the vSphere Client and vSphere Web Client. My colleague checked the VCSA VM and noticed that the VPXD failed to start (“Waiting for vpxd to initialize: ….failed”). Together we checked the appliance and the log files. The vpxd.log (/var/log/vmware/vpx) was updated weeks ago, but the last entry was interesting: No space left on device. But there was free disk space on /storage/log. I immediately checked the inode count with df -i and there it was: No free inodes. Why is this a problem? Each name entry in the file system consumes an inode. If there are no free inodes, no new directories and files can be created. The error message is the same as for missing disk space. Something had to have created a lot of files on /storage/log. Because /var/log/vmware is a symbolic linkt to /storage/log/vmware, it had to be something on the /storage/log partition. We checked the remote syslog location under /storage/log/remote and found gigabytes and an incredible number of logs. After removing the logs, the VPXD was able to start and the inode count was on a normal level.

But why were there so many logs? We checked the logrotate config and found a faulty config for the remote syslog files. Instead of rotating logs and remove old ones, this config rotated all logs every day and potentiated the number of logs. Please note that there is no logrotate config to rotate remote syslog files by default! This one was added manually.

This is the default config for the remote syslog-collector of the VCSA:

destination log_remote {
            file("/var/log/remote/$HOST_FROM/$YEAR-$MONTH/messages-$YEAR-$MONTH-$DAY"
            create_dirs(yes) frac-digits(3)
            template("$ISODATE $PROGRAM $MSGONLY\n")
            template_escape(no)
            );
};

As you can see, with these settings a folder for each host and each month is created. According to this VMTN posting, we changed the syslog-collector config a bit:

destination log_remote {
            file("/var/log/remote/$HOST_FROM/messages"
            create_dirs(yes) frac-digits(3)
            template("$ISODATE $PROGRAM $MSGONLY\n")
            template_escape(no)
            );
};

With this settings, only a single file per host is created. We made also a change to /etc/logrotate.d/syslog and added this at the end:

/var/log/remote/*/messages {
  daily
  compress
  delaycompress
  rotate 30
  postrotate
    /etc/init.d/syslog-collector reload > /dev/null
  endscript
}

With this configuration 30 log files will be preserved. The number of log files or how often log rotation should happen (weekly or daily) can easily be adjusted. But these settings should be sufficient for small environments.

It’s important to understand that the VCSA has different disks and that the disks are mountend to different mount points within the root filesystem. This is from a vSphere 5.5 VCSA:

vcsa1:~ # mount
/dev/sda3 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw)
udev on /dev type tmpfs (rw,mode=0755)
tmpfs on /dev/shm type tmpfs (rw,mode=1777)
devpts on /dev/pts type devpts (rw,mode=0620,gid=5)
/dev/sda1 on /boot type ext3 (rw,noexec,nosuid,nodev,noacl)
/dev/sdb1 on /storage/core type ext3 (rw,nosuid,nodev)
/dev/sdb2 on /storage/log type ext3 (rw,nosuid,nodev)
/dev/sdb3 on /storage/db type ext3 (rw,nosuid,nodev)

/var/log/vmware and /var/log/remote are links to /storage/log/vmware and /storage/log/remote. Make sure that there is always enough free diskspace on ALL disks! I also want to highlight VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). This error hit me a couple of times…

Chicken-and-egg problem: 3PAR VSP 4.3 MU1 & 3PAR OS 3.2.1 MU3

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Since monday I’m helping a customer to put two HP 3PAR StoreServ 7200c into operation. Both StoreServs came factory-installed with 3PAR OS 3.2.1 MU3, which is available since July 2015. Usually, the first thing you do is to deploy the 3PAR Service Processor (SP). These days this is (in most cases) a Virtual Service Processor (VSP). The SP is used to initialize the storage system. Later, the SP reports to HP and it’s used for maintenance tasks like shutdown the StoreServ, install updates and patches. There are only a few cases in which you start the Out-of-the-Box (OOTB) procedure of the StoreServ without having a VSP. I deployed two (one VSP for each StoreServ) VSPs, started the Service Processor Setup Wizard, entered the StoreServ serial number and got this message:

3par_vsp_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

“No uninitialized storage system with the specified serial number could be found”. I double checked the network setup, VLANs, switch ports etc. The error occured with BOTH VSPs and BOTH StoreServs. I started the OOTB on both StoreServs using the serial console. My plan was to import the StoreServs later into the VSPs. To realize this, I tried was to setup the VSP using the console interface. I logged in as root (no password) and tried the third option: Setup SP with original SP ID.

3par_vsp_error_console

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Not the worst idea, but unsuccessful. I entered the SP ID, SP networking details, a lot other stuff, the serial number of the StoreServ, the IP address, credentials finally got this message:

StoreServ HP 3PAR OS version validation failed - unable to retrieve StoreServ's HP 3PAR OS version.

Hmm… I knew that P003 was mandatory for the VSP 4.3 MU1 and 3PAR OS 3.2.1 MU3. But could cause the missing patch this behaviour? I called HP and explained my guess. After a short remote session this morning, the support case was escalated to the 2nd level. While waiting for the 2nd level support, I was thinking about a solution. I knew that earlier releases of the VSP doesn’t check the serial number of the StoreServ or the version of the 3PAR OS. So I grabbed a copy of the VSP 4.1 MU2 with P009 and deployed the VSP. This time, I was able to finish the “Moment of Birth” (MOB). This release also asked for the serial number, the IP address and login credentials, but it didn’t checked the version of the 3PAR OS (or it doesn’t care if it’s unknown). At this point I had a functional SP running software release 4.1 MU2. I upgraded the SP to 4.3 MU1 with the physical SP ISO image and installed P003 afterwards. Now I was able to import the StoreServ 7200c with 3PAR OS 3.2.1 MU3.

I don’t know how HP covers this during the installation service. AFAIK there is no VSP 4.3 MU1 with P003 available and I guess HP ships all new StoreServs with 3PAR OS 3.2.1 MU3. If you upgrade from an earlier 3PAR OS release, please make sure that you install P003 before you update the 3PAR OS. The StoreServ Refresh matrix clearly says that P003 is mandatory. The release notes for the HP 3PAR Service Processor (SP) Software SP-4.3.0 MU1 P003 also indicate this:

SP-4.3.0.GA-24 P003 is a mandatory patch for SP-4.3.0.GA-24 and 3.2.1.MU3.

I’m excited to hear from the HP 2nd level support. I will update this blog post if I have more information.

EDIT

Together with the StoreServ 8000 series, HP released a new version of the 3PAR Service Processor. The new version 4.4 is necessary for the new StoreServ models, but it also supports 3PAR OS < 3.2.2 (which is the GA release for the new StoreServ models). So if you get a new StoreServ 7000 with 3PAR OS 3.2.1 MU3, simply deploy a SP version 4.4.

Microsoft Exchange 2013 shows blank ECP & OWA after changes to SSL certificates

This posting is ~6 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in CU6.

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

An error occurred while using SSL configuration for endpoint 0.0.0.0:444. The error status code is contained within the returned data.

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

C:\windows\system32>netsh http show sslcert

SSL Certificate bindings:
-------------------------

    IP:port                      : 0.0.0.0:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:444
    Certificate Hash             : a80c9de605a1525cd252c250495b459f06ed2ec1
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:8172
    Certificate Hash             : 09093ca95154929df92f1bee395b2670a1036a06
    Application ID               : {00000000-0000-0000-0000-000000000000}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 127.0.0.1:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

C:\windows\system32>netsh http delete sslcert ipport=0.0.0.0:444

SSL Certificate successfully deleted

Now add it again with the correct certificate hash and application ID.

C:\windows\system32>netsh http add sslcert ipport=0.0.0.0:444 certhash=1ec7413b4fb1782b4b40868d967161d29154fd7f appid="{4dc3e181-e14b-4a21-b022-59fc669b0914}"

SSL Certificate successfully added

That’s it. Reboot the Exchange 2013 server and everything should be up and running again.