Tag Archives: troubleshooting

Checking the 3PAR Quorum Witness appliance

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

3PAR Quorum Witness Status

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

How to check the Quorum Witness appliance?

You can check the status of the appliance with a simple web request. The documentation shows a simple test based on curl. You can run this direct from the BASH of the appliance.

But you can also use the PowerShell cmdlet Invoke-WebRequest.

If you add /witness to the URL, you can test the access to the database, which is used for Peer Persistence.

If you get a connection error, check if the beam process is running.

If not, reboot the appliance. This can be done without downtime. The appliance comes only into play, if a failover occurs.

Replacing an expired lookup service SSL certificate on a vSphere PSC

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A few days ago, I ran into a very nasty problem. Fortunately, it was in my lab. Some months ago, I replaced the certificates of my vCenter Server Appliance (VCSA), and I’ve chosen to use the VMware Certificate Authority (VMCA) as a subordinate of my AD-based enterprise CA. The VMCA was used as intermediate CA. The certificates were replaced using the  vSphere 6.0 Certificate Manager (/usr/lib/vmware-vmca/bin/certificate-manager), and I followed the instructions of KB2112016 (Configuring VMware vSphere 6.0 VMware Certificate Authority as a subordinate Certificate Authority).

The VCSA was migrated from vSphere 5.5, and with vSphere 5.5 I was also using custom certificates. These certificates were also issued by my AD-based enterprise CA, and these certificates were migration during the vSphere 5.5 > 6.0 migration. So at the end, I replaced custom certificates with VMCA (as an intermediate CA) certificates.

Everything was fine, until a power outage. After powering-on my VMs, I noticed several errors. After logging into the vSphere Web Client, I got an error message at the top of the page:

While searching for the cause, I checked the URL of the Platform Services Controller (https://vcsa1.lab.local/psc/login) and got this:

psc_error_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0


This error led me to KB2144086 (Updating certificates using certificate manager on vCenter Server or PSC 6.0 Update 1b fails), but was able to proof, that I have used different subject names for the different solution user certificates.

While digging in the PSC logs, I found this error in the /var/log/vmware/psc-client/psc-client.log:

Finally, I found Aaron Smiths blog post “Troubleshooting Expired PSC Certificates with vSphere 6“, who had the same problem. I checked the certificate of the Lookup Service and there it was:

psc_error_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This was the original custom certificate, issued by my AD-based enterprise CA, and installed on my vSphere 5.5 VCSA.

Aaron also offered the solution by referencing KB2118939 (Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0). I followed the instructions in KB2118939 and replaced the certificate of the Lookup Service with a certificate of the VMCA.

Take care of your certificates

With vSphere 6.0, the Lookup Service should be accessed through the HTTP Reverse Proxy. This proxy uses the machine certificate. Therefore, an expired Lookup Certificate is not obvious. If you connect directly to the Lookup Service using port 7444, you will see the expired certificate. The Lookup Service certificate is not replaced with a custom certificate, if you replace the different solution user certificates.

If you have a vSphere 6.0 VCSA, which was migrated from vSphere 5.5, and you have replaced the certificates on that vSphere 5.5 VCSA with custom certificates, you should check your Lookup Service certificate immidiately! Follow KB2118939 for further instructions.

Credit to Aaron Smith for this blog post. Thank you!

Solving problems: A structured approach

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Every problem has characteristics, that can be used to describe it.

  • Solubility

Not every problem is solvable. Think about “Squaring the circle“. But often a problem seems to be unsolvable because it’s not well defined. If Initial situation, obstacle and target situation are not clearly formulated, you won’t be able to solve the problem.

  • Decomposability

If you can decompose a problem into multiple subproblems, it is a hierarchical problem. Otherwise, it’s an elemental problem.

  • Effort

The effort to solve a problem is always different.. A problem is theoretically solvable, but it may require such a high effort, that it is practically unsolvable.

  • Subjectivity

Even if a problem is well defined, it appears different in regard to complexity for different people.

How to start?

First of all

  • Understand and define the problem

This is most important part. Before you try to solve a problem, make sure that you have really understood the problem. Then you should define the problem. Only a clearly defined problem can be solved. And it’s much easier to solve a clearly defined, than a vague problem. If it’s a complex problem, then you should try to

  • Simplify or decompose the problem

A simplified problem can help you stay focused. If you can’t simplify a problem, you can try to divide it into subproblems. With a clearly defined, simplified/ structured problem, you can start to

  • Find the root cause

Collecting information is the key. Collect information about what happened before, during, and after the problem has occurred. Identifying the root cause for a problem can be a time consuming task. But let me say this clearly: Information is the key. Information that help to find the root cause are not only observations (e.g. logs, error messages etc.). You can can use the results of systematic tests. Collect as much data as you can.

Sometimes it can be useful to create a hypothesis.

Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.

If you see that System A is affected, but system B should be affected too, but it’s not, it might be time to create a hypothesis. With a hypothesis in mind, you can try to prove it. Test the hypothesis by performing tests and collecting data. This strategy is called “hypothesis testing”.

At some point, you should have identified the root cause. With the now known root cause, you can

  • Create solutions and select the best one

Sometimes it’s easy. But sometimes it’ not that easy. A trade-off analysis can help to identify the best of multiple solutions.

  • Create an action plan

Even if you only have to disable a specific feature, it’s a good idea to formulate an action plan. Even if consists only of three lines… You should state clearly

  • WHAT you do
  • WHY do you have to do it, and
  • HOW to you plan to check it

With these steps, you should be well prepared. It doesn’t matter what kind of problem you are trying to solve: The process is basically the same.

Other problem solving methods

Over the years many problem solving methods have been developed. Kepner-Tregoe is one of them. Other well known methods are:

  • A3 Problem Solving
  • PDCA
  • Eight Disciplines (8D) Problem Solving
  • Failure mode and effects analysis (FMEA)

A3 Problem Solving has been developed at Toyota for their Toyota Production System (TPS). It’s an often used method in Lean Manufacturing. A3 helps to solve problems by pretending a structure (WHAT IS and WHAT IS NOT the problem, describe the problem, root cause, solution etc). This strucure is placed on an A3 sheet paper (that why it’s called A3). The process is based on the principles of Deming’s PDCA cycle.

PDCA, or Plan-Do-Check-Act (sometimes Shewhart-Cycle) was made popular by Dr. Edwards Deming. Plan-Do-Check-Act refers to the four phases of this cycle.

  • Plan: Plan the change
  • Do: implement the change
  • Check: Check the sucess of the implemented change
  • Act: Take action based on the results of “Check”

Eight Disciplines (8D) Problem Solving was developed by the Ford Motor Company. The D0 phase is the starting point for the D8 process, but it’s not counted.

  • D0:  Plan for solving the problem and determine the prerequisites
  • D1: Establish a team of people with the required skills and knowledge
  • D2: Describe the problem
  • D3: Define and implement containment actions
  • D4: Determine and verify the root causes
  • D5: Plan permanent corrective actions for the observed problem
  • D6: Implement the best permanent corrective actions
  • D7: Modify management systems to prevent a recurrence
  • D8: Congratulate your team!

The Failure mode and effects analysis (FMEA) is a highly structured, systematic approach for failure analysis. There are different FMEA alalyses:

  • Functional
  • Design
  • Process

FMEA is based on inductive reasoning (forward logic). FMEA is based on a highly structured process, which can be represented as followed.

  • Structural analysis: A system is divided into its components
  • Functional analysis: Identify the function of each component
  • Failure analysis: Identify the possible failures for each component
  • Calculate the risk: Risk Priority Number = occurrence ranking x detection ranking x highest severity ranking
  • Optimize: Optimize the component to mitigate the risk

No matter what, stay organized

The key to successfully solve problems is to stay organized. Solving problems isn’t magic. It is a very structured process that gets better with increasing experience. Try to create your own, structured method. Or use one of the mentioned problem solving methods. But in general:

  • Always try to describe a problem
  • Try to simplify or break it into smaller problems
  • Search and verify for the root cause
  • Develop a solution

Windows recieves wrong DNS server from DHCP after DHCPINFORM

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, I was surprisingly booked by a customer who observed a problem in his network. Unfortunately, colleagues worked on this network some day before (moving servers, routers etc. to a new pair of HP 7509 new core switches).

It was quickly clear, that some of the clients have received the wrong DNS servers from the DHCP server. The environment is a bit unusual. The customer is running two Active Directory domains (root and sub domain) in a single layer 2 broadcast domain. This nothing unusual, but he is also running two DHCP servers in the same layer 2 broadcast domain. To get this working, the customer uses exclusion ranges and reservations. This guarantees, that the client receives the correct DHCP information.

Observations

It was quicky clear, that some of the clients have received the wrong DNS servers from the DHCP server. That is the (defaced) output of a SUBDOM client with the correct IP settings.

And this is the output of the same client, after a reboot:

As you can see, the client got its DHCP information from the same DHCP server, but with the wrong DNS settings. The DNS servers are the servers from the ROOTDOM.

  • only clients from SUBDOM were affected
  • only Windows XP and Windows 7 clients were affected
  • Windows 8.1 and Windows 10 clients were not affected
  • Igel Thin Clients were not affected
  • after an “ipconfig /renew”, the correct DNS servers were registered
  • after an “ipconfig /release” and “ipconfig /renew”, the wrong DNS servers were registered
  • the same happened after a reboot
  • Wireshard packet trace showed, that the correct DNS information were included in the DHCPOFFER

The sum of observation told me, that this has nothing (or less) to do with the network changes. Interestingly, the correct DNS information were included in the DHCPOFFER and the behaviour was only observed on Windows XP and Windows 7. In addition, only clients of SUBDOM were affected.

The smoking gun

The packet trace with Wireshark showed, that the correct DNS information was included in the DHCPOFFER. But I also saw, that the client has sent a DHCPINFORM, which was answered by all available DHCP servers.

dhcp_inform

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

DHCPINFORM is used by a client to discover more information, e.g. router, proxy, static routes… or DNS. The DHCPINFORM request was only sent after a reboot, or after a DHCPRELEASE and a subsequent DHCPDISCOVER. I saw that all available DHCP servers answered the DHCPINFORM request with a DHCPACK. This DHCPACK included the requested information, including DNS. I quickly developed the hypothesis, that the DHCPACK from the ROOTDOM DHCP servers was used by the client, to add the wrong DNS information to the configuration.

With this information, I quickly found a references (MerakiLaurent Gaffié) to a registry key, that can be used to disable DHCPINFORM. This registry key is valid for Windows 2000, 2003, Windows XP, Windows Vista and Windows 7. Especially the blog post from Laurent Gaffié gave is interesting:

A vulnerability in Windows DHCP (http://www.ietf.org/rfc/rfc2131.txt) was found on Windows OS versions ranging from Windows 2000 through to Windows server 2003.  This vulnerability allows an attacker to remotely overwrite DNS, Gateway, IP Addresses, routing, WINS server, WPAD, and server configuration with no user interaction.

It’s useful to disable DHCPINFORM, even if you don’t have a problem!

Disable DHCPINFORM

To disable DHCPINFORM, you must add a registry key for the network interface, that shouldn’t sent DHCPINFORM messages.

Unfortunately the GUID of the interface differs between clients. I build this Visual Basic script (like Dr. Frankenstein: Different sources, plugged together, but it works) to add the registry key. You can run this script as part of a startup script with a Group Policy.

You should test this script very carefully! I provide this script “AS IS” with no warranties.

I don’t know why this has happened. I assume that the customer had this problem for some time. But due to some strange effects, he never noticed it. One hypothesis is, that the sequence of the DHCPACK messages after a DHCPINFORM has an influence. The DHCP server of ROOTDOM was moved to the new core switches, and maybe this changed the sequence of the answer packets. But it’s only a hypothesis, not a theory.

VMware Update Manager reports “error code 99” during scan operation

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After updating my lab to VMware vSphere 6.0 U2, one of my hosts continuously thrown an error during an update scan.

The first thing I’ve checked was the esxupdate.log on the affected ESXi host. This is the output, that was logged during a scan operation.

You might notice the “Unrecognized file vendor-index.xml in Metadata file” error. I also found this error message on the other hosts, so I excluded it from further research. It was unlikely, that this error was related to the observed problem. I started searching differences between the hosts and found out, that the output of “esxcli software vib list” was different on the faulty host.

This is the output on the faulty host:

This is the output on a working host. You see the difference?

Doesn’t look right… I investigated further, still searching for differences. And then I found two empty directories under /var/db/esximg.

The same directory was populated on other, working hosts.

One possible solution was therefore to copy the missing files to the faulty host. I used SCP for this. To get SCP working, you have to enable the SSH Client in the ESXi firewall.

After that, I’ve copied the files from a working host to the faulty host. Please make sure that the hosts have the same build! In my case, both hosts had the same build. Don’t try to copy files from an older or newer build to the host!!

And because we are pros, we disable the SSH Client after using it.

As expected, “esxcli software vib list” was working again.

A rescan operation in the vSphere Client was also successful. It seems that the root cause for the problem were missing files under /var/db/esximg.

Please don’t ask why this has happened. I really have no idea. But VMware KB2043170 (Initializing the VMware vCenter Update Manager database without reinstalling it) isn’t always the solution for “error code 99”, as sometimes written somewhere in the internet. Always try to analyze the problem and try to filter out unlikely and likely solutions.

Data Protector: Exchange backup failes because of database lock

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had a customer call, where a Exchange 2010 backup repeatedly failed. HPE Data Protector was unable to create a differential or incremental backup. For each database, the following error was logged:

Interestingly, there was no other backup session running. But the night before, the backup jobs failed because of a network failure.

The solution is easy. This error is caused by a wrong information in the Data Protector database. To remove this, open an administrative CMD on the Data Protector Cell Manager and run this omnidbutil command:

This command  will free up the locked resources in the Data Protector database.Then, run the job again.

ALE OmniSwitch stack does not form due to incompatible licenses

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I saw an interesting behaviour of two Alcatel-Lucent Enterprise OmniSwitch 6450. Both switches has been configured as a stack, but one of the switches showed a flashing ID after the startup, and the stack was not formed. While I checked the logs and the status of the stack, I noticed that the slot number was incorrect. Furthermore the status showed “INC-LIC”.

According to the stack status and the switch logs, there seems to be a problem with the licenses. So I checked the installed licenses on both switches. On switch showed Metro license:

The other switch not:

Don’t be confused because of the slot numbering. I pulled the stacking cable.

The solution was easy: I removed the metro license and after a reboot of the switch, from which I removed the license, the stack formed properly.

Using VCSA as remote syslog – Don’t forget the log rotation!

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Important note: It seems that vCenter Server Appliance updates revert the changes. Please check the settings after each update!

The VMware vCenter Server Appliance (VCSA) can act as a remote syslog destition for ESXi hosts. This is very handy for troubleshooting and I really recommend to use this feature.  But VMware ESXi hosts can be really chatty and therefore it’s a good idea to keep an eye on the free disk space of the VCSA.

Yesterday, a colleague had an interesting support case. A customer reported that his Veeam Backup & Replication jobs failed and that he was unable to login to the vCenter with the vSphere Client and vSphere Web Client. My colleague checked the VCSA VM and noticed that the VPXD failed to start (“Waiting for vpxd to initialize: ….failed”). Together we checked the appliance and the log files. The vpxd.log (/var/log/vmware/vpx) was updated weeks ago, but the last entry was interesting: No space left on device. But there was free disk space on /storage/log. I immediately checked the inode count with df -i and there it was: No free inodes. Why is this a problem? Each name entry in the file system consumes an inode. If there are no free inodes, no new directories and files can be created. The error message is the same as for missing disk space. Something had to have created a lot of files on /storage/log. Because /var/log/vmware is a symbolic linkt to /storage/log/vmware, it had to be something on the /storage/log partition. We checked the remote syslog location under /storage/log/remote and found gigabytes and an incredible number of logs. After removing the logs, the VPXD was able to start and the inode count was on a normal level.

But why were there so many logs? We checked the logrotate config and found a faulty config for the remote syslog files. Instead of rotating logs and remove old ones, this config rotated all logs every day and potentiated the number of logs. Please note that there is no logrotate config to rotate remote syslog files by default! This one was added manually.

This is the default config for the remote syslog-collector of the VCSA:

As you can see, with these settings a folder for each host and each month is created. According to this VMTN posting, we changed the syslog-collector config a bit:

With this settings, only a single file per host is created. We made also a change to /etc/logrotate.d/syslog and added this at the end:

With this configuration 30 log files will be preserved. The number of log files or how often log rotation should happen (weekly or daily) can easily be adjusted. But these settings should be sufficient for small environments.

It’s important to understand that the VCSA has different disks and that the disks are mountend to different mount points within the root filesystem. This is from a vSphere 5.5 VCSA:

/var/log/vmware and /var/log/remote are links to /storage/log/vmware and /storage/log/remote. Make sure that there is always enough free diskspace on ALL disks! I also want to highlight VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). This error hit me a couple of times…

Chicken-and-egg problem: 3PAR VSP 4.3 MU1 & 3PAR OS 3.2.1 MU3

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Since monday I’m helping a customer to put two HP 3PAR StoreServ 7200c into operation. Both StoreServs came factory-installed with 3PAR OS 3.2.1 MU3, which is available since July 2015. Usually, the first thing you do is to deploy the 3PAR Service Processor (SP). These days this is (in most cases) a Virtual Service Processor (VSP). The SP is used to initialize the storage system. Later, the SP reports to HP and it’s used for maintenance tasks like shutdown the StoreServ, install updates and patches. There are only a few cases in which you start the Out-of-the-Box (OOTB) procedure of the StoreServ without having a VSP. I deployed two (one VSP for each StoreServ) VSPs, started the Service Processor Setup Wizard, entered the StoreServ serial number and got this message:

3par_vsp_error

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

“No uninitialized storage system with the specified serial number could be found”. I double checked the network setup, VLANs, switch ports etc. The error occured with BOTH VSPs and BOTH StoreServs. I started the OOTB on both StoreServs using the serial console. My plan was to import the StoreServs later into the VSPs. To realize this, I tried was to setup the VSP using the console interface. I logged in as root (no password) and tried the third option: Setup SP with original SP ID.

3par_vsp_error_console

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

Not the worst idea, but unsuccessful. I entered the SP ID, SP networking details, a lot other stuff, the serial number of the StoreServ, the IP address, credentials finally got this message:

Hmm… I knew that P003 was mandatory for the VSP 4.3 MU1 and 3PAR OS 3.2.1 MU3. But could cause the missing patch this behaviour? I called HP and explained my guess. After a short remote session this morning, the support case was escalated to the 2nd level. While waiting for the 2nd level support, I was thinking about a solution. I knew that earlier releases of the VSP doesn’t check the serial number of the StoreServ or the version of the 3PAR OS. So I grabbed a copy of the VSP 4.1 MU2 with P009 and deployed the VSP. This time, I was able to finish the “Moment of Birth” (MOB). This release also asked for the serial number, the IP address and login credentials, but it didn’t checked the version of the 3PAR OS (or it doesn’t care if it’s unknown). At this point I had a functional SP running software release 4.1 MU2. I upgraded the SP to 4.3 MU1 with the physical SP ISO image and installed P003 afterwards. Now I was able to import the StoreServ 7200c with 3PAR OS 3.2.1 MU3.

I don’t know how HP covers this during the installation service. AFAIK there is no VSP 4.3 MU1 with P003 available and I guess HP ships all new StoreServs with 3PAR OS 3.2.1 MU3. If you upgrade from an earlier 3PAR OS release, please make sure that you install P003 before you update the 3PAR OS. The StoreServ Refresh matrix clearly says that P003 is mandatory. The release notes for the HP 3PAR Service Processor (SP) Software SP-4.3.0 MU1 P003 also indicate this:

SP-4.3.0.GA-24 P003 is a mandatory patch for SP-4.3.0.GA-24 and 3.2.1.MU3.

I’m excited to hear from the HP 2nd level support. I will update this blog post if I have more information.

EDIT

Together with the StoreServ 8000 series, HP released a new version of the 3PAR Service Processor. The new version 4.4 is necessary for the new StoreServ models, but it also supports 3PAR OS < 3.2.2 (which is the GA release for the new StoreServ models). So if you get a new StoreServ 7000 with 3PAR OS 3.2.1 MU3, simply deploy a SP version 4.4.

Microsoft Exchange 2013 shows blank ECP & OWA after changes to SSL certificates

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in CU6.

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

Now add it again with the correct certificate hash and application ID.

That’s it. Reboot the Exchange 2013 server and everything should be up and running again.