Tag Archives: troubleshooting

Exchange receive connector rejects incoming connections

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a bigger Microsoft Exchange migration, one of my customers moved the in- and outbound mailflow to a newly installed mail relay cluster. We modified MX records to move the mailflow to the new mail relay, because the customer also switched the ISP. While changing the MX records for ~40 domains, and therefore more and more mails received through the new mail relay cluster, we noticed events from MSExchangeTransport (event id 1021):

192.168.xxx.xxx is the mail relay cluster, which is used for the in- and outbound mailflow.

This event indicates that the remote server has reached the maximum number of simultaneous incoming connections to the receive connector. This value is specified by the MaxInboundConnectionPerSource  parameter, and the default value is 20. You can easily increase the value using the  Set-ReceiveConnector  cmdlet.

Microsoft has decreased this value over time. It was 100 in Exchange 2007, but 20 since Exchange 2010.

Roaming of AppData\Local breaks Windows 10 Start Menu

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

One of my customers has started a project to create a Windows 10 Enterprise (LTSB 2016) master for their VMware Horizon View environment. Beside the fact (okay, it is more a personal feeling), that Windows 10 is a real PITA for VDI, I noticed an interesting issue during tests.

The issue

For convenience, I adopted some settings of the current Persona Management GPO for Windows 7 for the new Windows 10 environment. During the tests, the customer and I noticed a strange behaviour: After login, the start menu won’t open. The only solution was to logoff and delete the persona folder (most folders are redirected using native Folder Redirections, not the redirection feature of the View Persona Management). While debugging this issue, I found this error in the eventlog.

If you google this, you will find many, many threads about this. Most solutions describe, that you have to delete the profile due to wrong permissions on profile folders and/ or registry hives. I used Microsofts Procmon to verify this, but I was unable to confirm that. After further investigations, I found hints, that the TileDataLayer database could be the problem. The database is located in AppData\Local\TileDataLayer\Database and stores the installed apps, programs, and tiles for the Start Menu. AFAIK it also includes the Start Menu layout.

Windows 10 Tile Data Layer Files

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

The database is part of the local part of the profile. A quick check proved, that it’s sufficient to delete the TileDataLayer folder. It will be recreated during the next logon.

The solution

It’s simple: Don’t roam AppData\Local. It should not be necessary to roam the local part of a users profile. The View Persona Management offers an option to roam the local part the profile. You can configure this behaviour with a GPO setting.

Horizon View Persona Management Roaming GPO Settings

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

You can find this setting under Computer Configuration > Administrative Templates > VMware View Agent Configuration > Persona Management > Roaming & Synchronization

I was able to reproduce the observed behavior in my lab with Windows 10 Enterprise (LTSB 2016) and Horizon View 7.0.3. Because of this, I tend to recommend not to roam AppData\Local.

Checking the 3PAR Quorum Witness appliance

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Two 3PAR StoreServs running in a Peer Persistence setup lost the connection to the Quorum Witness appliance. The appliance is an important part of a 3PAR Peer Persistence setup, because it acts as a tie-breaker in a split-brain scenario.

While analyzing this issue, I saw this message in the 3PAR Management Console:

3PAR Quorum Witness Status

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

In addition to that, the customer got e-mails that the 3PAR StoreServ arrays lost the connection to the Quorum Witness appliance. In my case, the CouchDB process died. A restart of the appliance brought it back online.

How to check the Quorum Witness appliance?

You can check the status of the appliance with a simple web request. The documentation shows a simple test based on curl. You can run this direct from the BASH of the appliance.

But you can also use the PowerShell cmdlet Invoke-WebRequest.

If you add /witness to the URL, you can test the access to the database, which is used for Peer Persistence.

If you get a connection error, check if the beam process is running.

If not, reboot the appliance. This can be done without downtime. The appliance comes only into play, if a failover occurs.

Replacing an expired lookup service SSL certificate on a vSphere PSC

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A few days ago, I ran into a very nasty problem. Fortunately, it was in my lab. Some months ago, I replaced the certificates of my vCenter Server Appliance (VCSA), and I’ve chosen to use the VMware Certificate Authority (VMCA) as a subordinate of my AD-based enterprise CA. The VMCA was used as intermediate CA. The certificates were replaced using the  vSphere 6.0 Certificate Manager (/usr/lib/vmware-vmca/bin/certificate-manager), and I followed the instructions of KB2112016 (Configuring VMware vSphere 6.0 VMware Certificate Authority as a subordinate Certificate Authority).

The VCSA was migrated from vSphere 5.5, and with vSphere 5.5 I was also using custom certificates. These certificates were also issued by my AD-based enterprise CA, and these certificates were migration during the vSphere 5.5 > 6.0 migration. So at the end, I replaced custom certificates with VMCA (as an intermediate CA) certificates.

Everything was fine, until a power outage. After powering-on my VMs, I noticed several errors. After logging into the vSphere Web Client, I got an error message at the top of the page:

While searching for the cause, I checked the URL of the Platform Services Controller (https://vcsa1.lab.local/psc/login) and got this:

psc_error_1

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0


This error led me to KB2144086 (Updating certificates using certificate manager on vCenter Server or PSC 6.0 Update 1b fails), but was able to proof, that I have used different subject names for the different solution user certificates.

While digging in the PSC logs, I found this error in the /var/log/vmware/psc-client/psc-client.log:

Finally, I found Aaron Smiths blog post “Troubleshooting Expired PSC Certificates with vSphere 6“, who had the same problem. I checked the certificate of the Lookup Service and there it was:

psc_error_2

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

This was the original custom certificate, issued by my AD-based enterprise CA, and installed on my vSphere 5.5 VCSA.

Aaron also offered the solution by referencing KB2118939 (Replacing the Lookup Service SSL certificate on a Platform Services Controller 6.0). I followed the instructions in KB2118939 and replaced the certificate of the Lookup Service with a certificate of the VMCA.

Take care of your certificates

With vSphere 6.0, the Lookup Service should be accessed through the HTTP Reverse Proxy. This proxy uses the machine certificate. Therefore, an expired Lookup Certificate is not obvious. If you connect directly to the Lookup Service using port 7444, you will see the expired certificate. The Lookup Service certificate is not replaced with a custom certificate, if you replace the different solution user certificates.

If you have a vSphere 6.0 VCSA, which was migrated from vSphere 5.5, and you have replaced the certificates on that vSphere 5.5 VCSA with custom certificates, you should check your Lookup Service certificate immidiately! Follow KB2118939 for further instructions.

Credit to Aaron Smith for this blog post. Thank you!

Solving problems: A structured approach

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

What is a problem? A problem is an obstacle, that has to be surmounted. Solving a problem is connected with obstacles. Or more general: Problem solving is a process to get from an unsatisfactory to a satisfactory situation.

Most of us get paid for solving problems. It’s irrelevant if you are paid for solving technical problem (e.g. My computer doesn’t work), or if you are paid to create solutions for customers (e.g. design infrastructure for a Citrix XenApp farm). At the end you solve a problem.

Every problem has characteristics, that can be used to describe it.

  • Solubility

Not every problem is solvable. Think about “Squaring the circle“. But often a problem seems to be unsolvable because it’s not well defined. If Initial situation, obstacle and target situation are not clearly formulated, you won’t be able to solve the problem.

  • Decomposability

If you can decompose a problem into multiple subproblems, it is a hierarchical problem. Otherwise, it’s an elemental problem.

  • Effort

The effort to solve a problem is always different.. A problem is theoretically solvable, but it may require such a high effort, that it is practically unsolvable.

  • Subjectivity

Even if a problem is well defined, it appears different in regard to complexity for different people.

How to start?

First of all

  • Understand and define the problem

This is most important part. Before you try to solve a problem, make sure that you have really understood the problem. Then you should define the problem. Only a clearly defined problem can be solved. And it’s much easier to solve a clearly defined, than a vague problem. If it’s a complex problem, then you should try to

  • Simplify or decompose the problem

A simplified problem can help you stay focused. If you can’t simplify a problem, you can try to divide it into subproblems. With a clearly defined, simplified/ structured problem, you can start to

  • Find the root cause

Collecting information is the key. Collect information about what happened before, during, and after the problem has occurred. Identifying the root cause for a problem can be a time consuming task. But let me say this clearly: Information is the key. Information that help to find the root cause are not only observations (e.g. logs, error messages etc.). You can can use the results of systematic tests. Collect as much data as you can.

Sometimes it can be useful to create a hypothesis.

Scientists generally base scientific hypotheses on previous observations that cannot satisfactorily be explained with the available scientific theories.

If you see that System A is affected, but system B should be affected too, but it’s not, it might be time to create a hypothesis. With a hypothesis in mind, you can try to prove it. Test the hypothesis by performing tests and collecting data. This strategy is called “hypothesis testing”.

At some point, you should have identified the root cause. With the now known root cause, you can

  • Create solutions and select the best one

Sometimes it’s easy. But sometimes it’ not that easy. A trade-off analysis can help to identify the best of multiple solutions.

  • Create an action plan

Even if you only have to disable a specific feature, it’s a good idea to formulate an action plan. Even if consists only of three lines… You should state clearly

  • WHAT you do
  • WHY do you have to do it, and
  • HOW to you plan to check it

With these steps, you should be well prepared. It doesn’t matter what kind of problem you are trying to solve: The process is basically the same.

Other problem solving methods

Over the years many problem solving methods have been developed. Kepner-Tregoe is one of them. Other well known methods are:

  • A3 Problem Solving
  • PDCA
  • Eight Disciplines (8D) Problem Solving
  • Failure mode and effects analysis (FMEA)

A3 Problem Solving has been developed at Toyota for their Toyota Production System (TPS). It’s an often used method in Lean Manufacturing. A3 helps to solve problems by pretending a structure (WHAT IS and WHAT IS NOT the problem, describe the problem, root cause, solution etc). This strucure is placed on an A3 sheet paper (that why it’s called A3). The process is based on the principles of Deming’s PDCA cycle.

PDCA, or Plan-Do-Check-Act (sometimes Shewhart-Cycle) was made popular by Dr. Edwards Deming. Plan-Do-Check-Act refers to the four phases of this cycle.

  • Plan: Plan the change
  • Do: implement the change
  • Check: Check the sucess of the implemented change
  • Act: Take action based on the results of “Check”

Eight Disciplines (8D) Problem Solving was developed by the Ford Motor Company. The D0 phase is the starting point for the D8 process, but it’s not counted.

  • D0:  Plan for solving the problem and determine the prerequisites
  • D1: Establish a team of people with the required skills and knowledge
  • D2: Describe the problem
  • D3: Define and implement containment actions
  • D4: Determine and verify the root causes
  • D5: Plan permanent corrective actions for the observed problem
  • D6: Implement the best permanent corrective actions
  • D7: Modify management systems to prevent a recurrence
  • D8: Congratulate your team!

The Failure mode and effects analysis (FMEA) is a highly structured, systematic approach for failure analysis. There are different FMEA alalyses:

  • Functional
  • Design
  • Process

FMEA is based on inductive reasoning (forward logic). FMEA is based on a highly structured process, which can be represented as followed.

  • Structural analysis: A system is divided into its components
  • Functional analysis: Identify the function of each component
  • Failure analysis: Identify the possible failures for each component
  • Calculate the risk: Risk Priority Number = occurrence ranking x detection ranking x highest severity ranking
  • Optimize: Optimize the component to mitigate the risk

No matter what, stay organized

The key to successfully solve problems is to stay organized. Solving problems isn’t magic. It is a very structured process that gets better with increasing experience. Try to create your own, structured method. Or use one of the mentioned problem solving methods. But in general:

  • Always try to describe a problem
  • Try to simplify or break it into smaller problems
  • Search and verify for the root cause
  • Develop a solution

Windows recieves wrong DNS server from DHCP after DHCPINFORM

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, I was surprisingly booked by a customer who observed a problem in his network. Unfortunately, colleagues worked on this network some day before (moving servers, routers etc. to a new pair of HP 7509 new core switches).

It was quickly clear, that some of the clients have received the wrong DNS servers from the DHCP server. The environment is a bit unusual. The customer is running two Active Directory domains (root and sub domain) in a single layer 2 broadcast domain. This nothing unusual, but he is also running two DHCP servers in the same layer 2 broadcast domain. To get this working, the customer uses exclusion ranges and reservations. This guarantees, that the client receives the correct DHCP information.

Observations

It was quicky clear, that some of the clients have received the wrong DNS servers from the DHCP server. That is the (defaced) output of a SUBDOM client with the correct IP settings.

And this is the output of the same client, after a reboot:

As you can see, the client got its DHCP information from the same DHCP server, but with the wrong DNS settings. The DNS servers are the servers from the ROOTDOM.

  • only clients from SUBDOM were affected
  • only Windows XP and Windows 7 clients were affected
  • Windows 8.1 and Windows 10 clients were not affected
  • Igel Thin Clients were not affected
  • after an “ipconfig /renew”, the correct DNS servers were registered
  • after an “ipconfig /release” and “ipconfig /renew”, the wrong DNS servers were registered
  • the same happened after a reboot
  • Wireshard packet trace showed, that the correct DNS information were included in the DHCPOFFER

The sum of observation told me, that this has nothing (or less) to do with the network changes. Interestingly, the correct DNS information were included in the DHCPOFFER and the behaviour was only observed on Windows XP and Windows 7. In addition, only clients of SUBDOM were affected.

The smoking gun

The packet trace with Wireshark showed, that the correct DNS information was included in the DHCPOFFER. But I also saw, that the client has sent a DHCPINFORM, which was answered by all available DHCP servers.

dhcp_inform

Patrick Terlisten/ www.vcloudnine.de/ Creative Commons CC0

DHCPINFORM is used by a client to discover more information, e.g. router, proxy, static routes… or DNS. The DHCPINFORM request was only sent after a reboot, or after a DHCPRELEASE and a subsequent DHCPDISCOVER. I saw that all available DHCP servers answered the DHCPINFORM request with a DHCPACK. This DHCPACK included the requested information, including DNS. I quickly developed the hypothesis, that the DHCPACK from the ROOTDOM DHCP servers was used by the client, to add the wrong DNS information to the configuration.

With this information, I quickly found a references (MerakiLaurent Gaffié) to a registry key, that can be used to disable DHCPINFORM. This registry key is valid for Windows 2000, 2003, Windows XP, Windows Vista and Windows 7. Especially the blog post from Laurent Gaffié gave is interesting:

A vulnerability in Windows DHCP (http://www.ietf.org/rfc/rfc2131.txt) was found on Windows OS versions ranging from Windows 2000 through to Windows server 2003.  This vulnerability allows an attacker to remotely overwrite DNS, Gateway, IP Addresses, routing, WINS server, WPAD, and server configuration with no user interaction.

It’s useful to disable DHCPINFORM, even if you don’t have a problem!

Disable DHCPINFORM

To disable DHCPINFORM, you must add a registry key for the network interface, that shouldn’t sent DHCPINFORM messages.

Unfortunately the GUID of the interface differs between clients. I build this Visual Basic script (like Dr. Frankenstein: Different sources, plugged together, but it works) to add the registry key. You can run this script as part of a startup script with a Group Policy.

You should test this script very carefully! I provide this script “AS IS” with no warranties.

I don’t know why this has happened. I assume that the customer had this problem for some time. But due to some strange effects, he never noticed it. One hypothesis is, that the sequence of the DHCPACK messages after a DHCPINFORM has an influence. The DHCP server of ROOTDOM was moved to the new core switches, and maybe this changed the sequence of the answer packets. But it’s only a hypothesis, not a theory.

VMware Update Manager reports “error code 99” during scan operation

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After updating my lab to VMware vSphere 6.0 U2, one of my hosts continuously thrown an error during an update scan.

The first thing I’ve checked was the esxupdate.log on the affected ESXi host. This is the output, that was logged during a scan operation.

You might notice the “Unrecognized file vendor-index.xml in Metadata file” error. I also found this error message on the other hosts, so I excluded it from further research. It was unlikely, that this error was related to the observed problem. I started searching differences between the hosts and found out, that the output of “esxcli software vib list” was different on the faulty host.

This is the output on the faulty host:

This is the output on a working host. You see the difference?

Doesn’t look right… I investigated further, still searching for differences. And then I found two empty directories under /var/db/esximg.

The same directory was populated on other, working hosts.

One possible solution was therefore to copy the missing files to the faulty host. I used SCP for this. To get SCP working, you have to enable the SSH Client in the ESXi firewall.

After that, I’ve copied the files from a working host to the faulty host. Please make sure that the hosts have the same build! In my case, both hosts had the same build. Don’t try to copy files from an older or newer build to the host!!

And because we are pros, we disable the SSH Client after using it.

As expected, “esxcli software vib list” was working again.

A rescan operation in the vSphere Client was also successful. It seems that the root cause for the problem were missing files under /var/db/esximg.

Please don’t ask why this has happened. I really have no idea. But VMware KB2043170 (Initializing the VMware vCenter Update Manager database without reinstalling it) isn’t always the solution for “error code 99”, as sometimes written somewhere in the internet. Always try to analyze the problem and try to filter out unlikely and likely solutions.

Data Protector: Exchange backup failes because of database lock

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had a customer call, where a Exchange 2010 backup repeatedly failed. HPE Data Protector was unable to create a differential or incremental backup. For each database, the following error was logged:

Interestingly, there was no other backup session running. But the night before, the backup jobs failed because of a network failure.

The solution is easy. This error is caused by a wrong information in the Data Protector database. To remove this, open an administrative CMD on the Data Protector Cell Manager and run this omnidbutil command:

This command  will free up the locked resources in the Data Protector database.Then, run the job again.

ALE OmniSwitch stack does not form due to incompatible licenses

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I saw an interesting behaviour of two Alcatel-Lucent Enterprise OmniSwitch 6450. Both switches has been configured as a stack, but one of the switches showed a flashing ID after the startup, and the stack was not formed. While I checked the logs and the status of the stack, I noticed that the slot number was incorrect. Furthermore the status showed “INC-LIC”.

According to the stack status and the switch logs, there seems to be a problem with the licenses. So I checked the installed licenses on both switches. On switch showed Metro license:

The other switch not:

Don’t be confused because of the slot numbering. I pulled the stacking cable.

The solution was easy: I removed the metro license and after a reboot of the switch, from which I removed the license, the stack formed properly.

Using VCSA as remote syslog – Don’t forget the log rotation!

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
Important note: It seems that vCenter Server Appliance updates revert the changes. Please check the settings after each update!

The VMware vCenter Server Appliance (VCSA) can act as a remote syslog destition for ESXi hosts. This is very handy for troubleshooting and I really recommend to use this feature.  But VMware ESXi hosts can be really chatty and therefore it’s a good idea to keep an eye on the free disk space of the VCSA.

Yesterday, a colleague had an interesting support case. A customer reported that his Veeam Backup & Replication jobs failed and that he was unable to login to the vCenter with the vSphere Client and vSphere Web Client. My colleague checked the VCSA VM and noticed that the VPXD failed to start (“Waiting for vpxd to initialize: ….failed”). Together we checked the appliance and the log files. The vpxd.log (/var/log/vmware/vpx) was updated weeks ago, but the last entry was interesting: No space left on device. But there was free disk space on /storage/log. I immediately checked the inode count with df -i and there it was: No free inodes. Why is this a problem? Each name entry in the file system consumes an inode. If there are no free inodes, no new directories and files can be created. The error message is the same as for missing disk space. Something had to have created a lot of files on /storage/log. Because /var/log/vmware is a symbolic linkt to /storage/log/vmware, it had to be something on the /storage/log partition. We checked the remote syslog location under /storage/log/remote and found gigabytes and an incredible number of logs. After removing the logs, the VPXD was able to start and the inode count was on a normal level.

But why were there so many logs? We checked the logrotate config and found a faulty config for the remote syslog files. Instead of rotating logs and remove old ones, this config rotated all logs every day and potentiated the number of logs. Please note that there is no logrotate config to rotate remote syslog files by default! This one was added manually.

This is the default config for the remote syslog-collector of the VCSA:

As you can see, with these settings a folder for each host and each month is created. According to this VMTN posting, we changed the syslog-collector config a bit:

With this settings, only a single file per host is created. We made also a change to /etc/logrotate.d/syslog and added this at the end:

With this configuration 30 log files will be preserved. The number of log files or how often log rotation should happen (weekly or daily) can easily be adjusted. But these settings should be sufficient for small environments.

It’s important to understand that the VCSA has different disks and that the disks are mountend to different mount points within the root filesystem. This is from a vSphere 5.5 VCSA:

/var/log/vmware and /var/log/remote are links to /storage/log/vmware and /storage/log/remote. Make sure that there is always enough free diskspace on ALL disks! I also want to highlight VMware KB2092127 (After upgrading to vCenter Server Appliance 5.5 Update 2, pg_log file reports this error: WARNING: there is already a transaction in progress). This error hit me a couple of times…