Category Archives: Server

Virtually reseated: Reset blade in a HPE C7000 enclosure

After a reboot, a VMware ESXi 6.7 U3 told me that he has no compatible NICs. Fun fact: Right before the reboot everything was fine.

The ILO also showed no NICs. Unfortunately, I wasn’t onsite to pull the blade server and put it back in. But there is a way to do this “virtually”.

You have to connect to the IP address of the Onboard Administrator via SSH. Then issue the reset server command with the bay of the server you want to reset and an argument.

The server will power up automatically. Please note, that the OBA is unable to display certain information right after this operation. It will take a couple of minutes until all information, like serial number or device bay name are visible again.

Update Manager fails with unknown error during host remediation

During an vSphere 6.5 > 6.7 update a was host failing continously at the remediation with an “unknown error”. The host was updated from ESXI 6.5 to 6.7 using an upgrade baseline. Other hosts were updated to 6.7 and with the latest patches without any issues. Something strange was going on…

The esxupdate.log and the vua.log on the host itself showed nothing special. So I checked the vmware-vum-server-log4cpp.log which was much more informative!

Well… ERROR: duplicate key value violates unique constraint “pk_vci_scanresults” is not what I expected, but it is an error, and it occured everytime I tried to remediate the host.

Google found nothing about this error, so I decided to reset the VUM database. Please don’t try this at your customer! Log a call at VMware.

To reset the VUM database:

  1. Connect to vCenter Server Appliance via SSH
  2. Switch to the BASH 
  3. Stop the VMware Update Manager Service with this command

    service-control –stop vmware-updatemgr
     
  4. To reset the VMware Update Manager Database (applies only to VCSA 6.7 and 7.0!)

    /usr/lib/vmware-updatemgr/bin/updatemgr-utility.py reset-db
  1. Delete the contents of the VMware Update Manager Patch Store

    rm -rf /storage/updatemgr/patch-store/*
     
  2. Start the VMware Update Manager Service again

    service-control –start vmware-updatemgr

You will lose all your baselines, so you have to configure them again. And you need to download all patches again.

For vSAN environments this procedure will also remove the vSAN default baselines, but they will recreated automatically when there is a configuration change to vSAN or an update to the HCL DB. Again: Don’t do this at home!

Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7

I’ve got several mails and comments about this topic. It looks like that the latest ESXi 6.7 updates are causing some trouble on HPE ProLiant Gen10 servers.

I’ve blogged about recurring host hardware sensor state alarm messages some weeks ago. A customer noticed them after an update. Last week, I got the first comments under this blog post abot fan failure messages after applying the latest ESXi 6.7 updates. Then more and more customers asked me about this, because they got these messages too in their environment after applying the latest updates.

Last Saturday I tweeted my blog post to give a hint to my followers who may be experiencing the same problem.

Fortunately one of my followers (Thanks Markus!) pointed me to a VMware KB article with a workaround: Fan health sensors report false alarms on HPE Gen10 Servers with ESXi 6.7 (78989).

This is NOT a solution, but a workaround. Keep that in Mind.

Thanks again to Markus. Make sure to visit his awesome blog (MY CLOUD-(R)EVOLUTION) , especially if you are interested in vSphere, Veeam and automation!

Notes for a 2-Tier Microsoft Windows PKI

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Implementing a public key infrastructure (PKI) is a recurring task for me. More and more customers tend to implement a PKI in their environment. Mostly not to increase security, rather then to get rid of browser warnings because of self-signed certificates, to secure intra-org email communication with S/MIME, or to sign Microsoft Office macros.

tumbledore / pixabay.com/ Pixybay License

What is a 2-tier PKI?

Why is a multi-tier PKI hierarchy a good idea? Such a hierarchy typically consits of a root Certificate Authority (CA), and an issuing CA. Sometimes you see a 3-tier hierarchy, in which a root CA, a sub CA and an issuing CA are tied together in a chain of trust.

A root CA issues, stores and signs the digital certificates for sub CA. A sub CA issues, stores and signs the digital certificates for issuing CA. Only an issuing CA issues, stores and signs the digital certificates for users and devices.

In a 2-tier hierarchy, a root CA issues the certificate for an issuing CA.

In case of security breach, in which the issuing CA might become compromised, only the CA certificate for the issuing CA needs to be revoked. But what of the root CA becomes compromised? Because of this, a root CA is typically installed on a secured, and powered-off (offline) VM or computer. It will only be powered-on to publish new Certificate Revocation Lists (CRL), or to sign/ renew a new sub or issuing CA certificate.

Lessons learned

Think about the processes! Creating a PKI is more than provisioning a couple of VMs. You need to think about processes to

  • request
  • sign, and
  • revoke

Be aware of what a digital certificate is. You, or your CA, confirms the identity of a party by handing out a digital certificate. Make sure that no one can issue certificates without a proof of his identity.

Think about lifetimes of certificates! Customers tend to create root CA certificates with lifetimes of 10, 20 or even 40 years. Think about the typical lifetime of a VM or server, which is necessary to run an offline root CA. Typically the server OS has a lifetime of 10 to 12 years. This should determine the lifetime of a root CA certificate. IMHO 10 years is a good compromise.

For a sub or issuing CA, a lifespan of 5 years is a good compromise. Using the same lifetime as for a root CA is not a good idea, because an issued certificate can’t be longer valid than the lifetime of the CA certificate of the issuing CA.

A lifespan of 1 to 3 years for thinks like computer or web server certificates is okay. If a certificate is used for S/MIME or code signing, you should go for a lifetime of 1 year.

But to be honest: At the end of the day, YOU decide how long your certificates will be valid.

Publish CRLs and make them accessable! You can’t know if a certificate is revoked by a CA. But you can use a CRL to check if a certificate is revoked. Because of this, the CA must publish CRLs regulary. Use split DNS to use the same URL for internal and external requests. Make sure that the CRL is available for external users.

This applies not only to certificates for users or computers, but also for sub and issuing CAs. So there must be a CRL from each of your CAs!

I recommend to publish CRLs to a webserver and make this webserver reachable over HTTP. An issued certificate includes the URL or path to the CRL of the CA, that has issued the certificate.

Make sure that the CRL has a meaningful validity period. Of an offline root CA, which issues only a few certificates of its lifetime, this can be 1 year or more. For an issuing CA, the validity period should only a few days.

Publish AIA (Authority Information Access) information and make them accessable! AIA is an certificate extension that is used to offer two types of information :

  • How to get the certificate of the issuing or upper CAs, and
  • who is the OCSP responder from where revocation of this certificate can be checked

I tend to use the same place for the AIA as for the CDP. Make sure that you configure the AIA extension before you issue the first certificates, especially configure the AIA and CDP extension before you issue intermediate and issuing CA certificates.

Use a secure hash algorithm and key length! Please stop using SHA1! I recommend at least SHA256 and 4096 bit key length. Depending on the used CPUs, SHA512 can be faster than SHA256.

Create a CApolicy.inf! The CApolicy.inf is located uder C:\Windows and will will be used during the creation of the CA certificate. I often use this CApolicy.inf files.

For the root CA:

For the issuing CA:

Final words

I do not claim that this is blog post covers all necessary aspects of such an complex thing like an PKI. But I hope that I have mentioned some of the important parts. And at least: I have a reference from which I can copy and paste the CApolicy.inf files. :D

Out-of-Office replies are dropped due to empty MAIL FROM

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had an interesting support call. A customer noticed that Out-of-Office replies were not received by recipients, even though the OoO option were enabled for internal and external recipients. Internal recipients got the OoO reply, but none of the external recipients.

cattu/ pixabay.com/ Creative Commons CC0

The Message Tracking Log is a good point to start. I quickly discovered that the Exchange server was unable to send the OoO mails. You can use the eventid FAIL to get a list of all failed messages.

Very interesting was the RecipientStatus of a failed mail.

550 Requested action not taken: mailbox unavailable  is a pretty interesting error when sending mails over a mail relay of your ISP. Especially when other mails were successfully sent over the same mail relay.

Next stop: Protcol log of the send connector

I enabled the logging on the send connector using the EAC. This option is disabled by default. Depending on the amount of mails sent over the connector, you should make sure to disable the logging after your troubleshooting session. To enable the logging, follow these steps:

  • Open the EAC and navigate to
  • Mail flow > Send connectors
  • Select the connector you want to configure, and then click Edit
  • On the General tab in the Protocol logging level section, select the Verbose option
  • When you’re finished, click Save

The protocol log can be found under %ExchangeInstallPath%TransportRoles\Logs\Hub\ProtocolLog\SmtpSend.

After enabling the logging and another test mail, the log contained the necessary details to find the root cause. This is the interesting part of the SMTP communication:

The error occured right after the exchange server issued MAIL FROM:<> . But why is the MAIL FROM empty?

RFC 2298 is the key

An Out-of-Office reply is a Delivery Status Notification message. And RFC 2298 clearly states:

The envelope sender address (i.e., SMTP MAIL FROM) of the MDN MUST be
null (<>), specifying that no Delivery Status Notification messages
or other messages indicating successful or unsuccessful delivery are
to be sent in response to an MDN.

So the empty MAIL FROM is something that a mail relay should expect. In case of my customer that mail relay seems to act different. Maybe some kind of spam protection.

Database Availability Group (DAG) witness is in a failed state

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a maintenance job I had to update a 2-node Exchange Database Availability Group and a file-share witness server.

After the installation of Windows updates on the witness server and the obligatory reboot, the witness left in a failed state.

In my opinion, the re-creation of the witness server and the witness directory cannot be the correct way to solve this. There must be another way to solve this. In addition to this: The server was not dead. Only a reboot occured.

Check the basics

Both DAG nodes were online and working. A good starting point is a check of the cluster resources using the PowerShell.

In my case the cluster resource for the File Share Witness was in a failed state. A simple Start-ClusterResource  solved my issue immediately.

In this case, it seems that the the cluster has marked the file share witness as unreliable, thus the resource was not started after the file share witness was back online again. In this case, I managed it to manually bring it back online by running  Start-ClusterResource  on one of the DAG members.

Using Let’s Encrypt DNS-01 challenge validation with local BIND instance

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I’m using Let’s Encrypt certificates for a while now. In the past, I used the standalone plugin (TLS-SNI-01) to get or renew my certificates. But now I switched to the DNS plugin. I run my own name servers with BIND, so it was a very low hanging fruit to get this plugin to work.

Clker-Free-Vector-Images/ pixabay.com/ Creative Commons CC0

To get or renew a certificate, you need to provide some kind of proof that you are requesting the certificate for a domain that is under your control. No certificate authority (CA) wants to be the CA, that hands you out a certificate for google.com or amazon.com…

The DNS-01 challenge uses TXT records in order to validate your ownership over a certain domain. During the challenge, the Automatic Certificate Management Environment (ACME) server of Let’s Encrypt will give you a value that uniquely identifies the challenge. This value has to be added with a TXT record to the zone of the domain for which you are requesting a certificate. The record will look like this:

This record is for a wildcard certificate. If you want to get a certificate for a host, you can add one or more TXT records like this:

There is a IETF draft about the ACME protocol. Pretty interesting read!

Configure BIND for DNS-01 challenges

I run my own name servers with BIND on FreeBSD. The plugin for certbot automates the whole DNS-01 challenge process by creating, and subsequently removing, the necessary TXT records from the zone file using RFC 2136 dynamic updates.

First of all, we need a new TSIG (Transaction SIGnature) key. This key is used to authorize the updates.

This key has to be added to the named.conf. The key is in the .key file.

The key is used to authroize the update of certain records. To allow the update of TXT records, which are needed for the challenge, add this to the zone part of you named.con.

The records start always with _acme-challenge.domainname.

Now you need to create a config file for the RFC2136 plugin. This file also includes the key, but also the IP of the name server. If the name server is running on the same server as the DNS-01 challenge, you can use 127.0.0.1 as name server address.

Now we have everything in place. This is a --dry-run  from on of my FreeBSD machines.

This is a snippet from the name server log file at the time of the challenge.

You might need to modify the permissons for the directory which contains the zone files. Usually the name server is not running as root. In my case, I had to grant write permissions for the “bind” group. Otherwise you might get “permission denied”.

 

Powering on a VM with shared VMDK fails after extending a EagerZeroedThick VMDK

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I hope that you are not reading this blog post while searching for a solution for a failed cluster. If so, feel free to leave a comment if this blog post saved your evening or weekend. :)

Last friday, a change at one of my customers went horribly wrong. I was not onsite, but they contacted me during the night from friday to saturday, because their most important Windows Server Failover Cluster was unable to start after extending a shared VMDK.

cripi/ pixabay.com/ Creative Commons CC0

They tried something pretty simple: Extending an virtual disk of a VM. That is something most of us doing pretty often. The customer did this also pretty often. It was a well known task… Except the fact, that the VM was part of a Windows Server Failover Cluster. With shared VMDKs. And the disks were EagerZeroedThick, because this is a requirement for shared VMDKs.

They extended the disk using the vSphere Web Client. And at this point, the change was doomed to fail. They tried to power-on the VMs, but all they got was this error:

VMware ESX cannot open the virtual disk, “/vmfs/volumes/4c549ecd-66066010-e610-002354a2261b/VMNAME/VMDKNAME.vmdk” for clustering. Please verify that the virtual disk was created using the ‘thick’ option.

A shared VMDK is a VMDK in multiwriter mode. This VMDK has to be created as Thick Provision Eager Zeroed. And if you wish to extend this VMDK, you must use  vmkfstools  with the option -d eagerzeroedthick. If you extend the VMDK using the Web Client, the extended portion of the disk will become LazyZeroed!

VMware has described this behaviour in the KB1033570 (Powering on the virtual machine fails with the error: Thin/TBZ disks cannot be opened in multiwriter mode). There is also a blog post by Cormac Hogan at VMware, who has described this behaviour.

That’s a screenshot from the failed cluster. Check out the type of the disk (Thick-Provision Lazy-Zeroed).

Patrick Terlisten/ vcloudnine.de/ Creative Commons CC0

You must use vmkfstools  to extend a shared VMDK – but vmkfstools is also the solution, if you have trapped into this pitfall. Clone the VMDK with option -d eagerzeroedthick.

Another solution, which was new to me, is to use Storage vMotion. You can migrate the “broken” VMDK to another datastore and change the the disk format during Storage vMotion. This solution is described in the “Notes” section of KB1033570.

Both ways will fix the problem. The result will be a Thick Provision Eager Zeroed VMDK, which will allow the VMs to be successfully powered on.

CloudFlare API v4 and Fail2ban: Fixing the unban action

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

In January 2017, I wrote an article about how to protect your WordPress blog using the WP Fail2Ban plugin, fail2ban on your Linux/ FreeBSD host, and CloudFlare. Back then, the fail2ban was using the CloudFlare API V1, which was already deprecated since November 2016.

Free-Photos/ pixabay.com/ Creative Commons CC0

Although the actions were updated later to use CloudFlare API V4, I still had problems with the unbaning of IP addresses. IP addresses were banned, but the unban action failed. 

This is the unban action, which is included in fail2ban (taken from fail2ban-0.10.3.1 which is shipped with FreeBSD 11.1-RELEASE-p10):

And this is the unban action, which finally solved this issue:

I found the solution at serverfault.com. The only difference is an additional tr -d '\n'  in the last line of the statement. Kudos to Jake for fixing this!

To prevent the action file to being overwritten, you should copy the original cloudflare.conf  located in the  action.d  directory, e.g. to mycloudflare.conf , and use the copied action file in your fail definition.

Windows Network Policy Server (NPS) server won’t log failed login attempts

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

This is just a short, but interesting blog post. When you have to troubleshoot authentication failures in a network that uses Windows Network Policy Server (NPS), the Windows event log is absolutely indispensable. The event log offers everything you need. The success and failure event log entries include all necessary information to get you back on track. If failure events would be logged…

geralt/ pixabay.com/ Creative Commons CC0

Today, I was playing with Alcatel-Lucent Enterprise OmniSwitches and Access Guardian in my lab. Access Guardian refers to the some OmniSwitch security functions that work together to provide a dynamic, proactive network security solution:

  • Universal Network Profile (UNP)
  • Authentication, Authorization, and Accounting (AAA)
  • Bring Your Own Device (BYOD)
  • Captive Portal
  • Quarantine Manager and Remediation (QMR)

I have planned to publish some blog posts about Access Guardian in the future, because it is a pretty interesting topic. So stay tuned. :)

802.1x was no big deal, mac-based authentication failed. Okay, let’s take a look into the event log of the NPS… okay, there are the success events for my 802.1x authentication… but where are the failed login attempts? Not a single one was logged. A short Google search showed me the right direction.

Failed logon/ logoff events were not logged

In this case, the NPS role was installed on a Windows Server 2016 domain controller. And it was a german installation, so the output of the commands is also in german. If you have an OS installed in english, you must replace “Netzwerkrichtlinienserver” with “Network Policy Server”.

Right-click the PowerShell Icon and open it as Administrator. Check the current settings:

As you can see, only successful logon and logoff events were logged.

The option /success:enable /failure:enable activeates the logging of successful and failed logon and logoff attempts.