Tag Archives: microsoft

Load balancing ADFS and ADFS Proxy using Citrix ADC

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week I had to setup a small Active Directory Federation Services (ADFS) farm that will be used to allow Single Sign-On (SSO) with Office 365.

Active Directory Federation Services (ADFS) is a solution developed by Microsoft to provide users an authenticated access to applications, that are not capable of using Integrated Windows Authentication (IWA).

Required by the customer was a two node ADFS farm located on the internal network, and a two node ADFS Proxy farm located at the DMZ.

An ADFS Proxyserver acts as a reverse proxy and it is typically located in your organizations perimeter network (DMZ).

This picture shows a typical ADFS/ ADFS Proxy setup:

ADFS/ WAP Design/ Citrix/ citrix.com

My customer has decided to use Citrix ADC (former NetScaler) to load balance the requests for the ADFS farm and the ADFS Proxy farm. In addition to load balancing, this offers high availability in case of a failed ADFS server or ADFS Proxy server. Please note that Citrix ADC can act as a ADFS Proxy, but this requires the Advanced Edition license. My customer “only” had a Standard License, so we had to setup dedicated ADFS Proxy servers on the DMZ network.

Citrix ADC setup

The ADFS service name is typically something like adfs.customer.tld. This farm name has to be the same for internal and external access. For internal access, the ADFS service name must be resolved to the VIP of the Citrix ADC. The same applies to external accesss. So you have to setup split DNS.

ADFS uses HTTP and HTTP, so my first attempt was to use this Citrix ADC Content Switch based setup:

add server srv_adfs1 x.x.x.x
add server srv_adfs2 x.x.x.y

add cs vserver cs_vsrv_adfs SSL x.x.x.x 443 -cltTimeout 180 -caseSensitive OFF
add lb vserver lb_vsrv_adfs SSL 0.0.0.0 0 -persistenceType SSLSESSION -cltTimeout 180

add cs action cs_action_adfs -targetLBVserver lb_vsrv_adfs
add cs policy cs_pol_adfs -rule "HTTP.REQ.URL.SET_TEXT_MODE(IGNORECASE).CONTAINS(\"adfs.customer.tld\")" -action cs_action_adfs
bind cs vserver cs_vsrv_adfsL -policyName cs_pol_adfs -priority 100

add serviceGroup svcgrp_adfs SSL -maxClient 0 -maxReq 0 -cip ENABLED X-MS-Forwarded-Client-IP -usip NO -useproxyport YES -cltTimeout 180 -svrTimeout 360 -CKA NO -TCPB NO -CMP YES -appflowLog DISABLED

add lb monitor mon_adfs HTTP-ECV -send "GET /federationmetadata/2007-06/federationmetadata.xml" -recv "adfs.customer.tld/adfs/services/trust" -LRTM ENABLED -secure YES

bind serviceGroup svcgrp_adfs srv_gk-adfs1 443 -CustomServerID "\"None\""
bind serviceGroup svcgrp_adfs srv_gk-adfs2 443 -CustomServerID "\"None\""
bind serviceGroup svcgrp_adfs -monitorName mon_adfs

bind lb vserver lb_vsrv_adfs svcgrp_adfs

bind ssl vserver lb_vsrv_adfs -certkeyName cert-key-pair
bind ssl vserver cs_vsrv_adfs -certkeyName cert-key-pair

set ssl vserver lb_vsrv_adfs -ssl3 DISABLED
set ssl vserver cs_vsrv_adfs -ssl3 DISABLED

This is a pretty common setup for HTTP/ HTTPS based services. But it doesn’t work… Mainly because the monitor was not getting the required response. So the monitored service was down for the ADC, and therefore the service group, the load balancing virtual server and the content switch won’t came up.

The reason for this is Server Name Indication (SNI), an extension to Transport Layer Security (TLS). SNI is enabled and required since ADFS 3.0. The monitor tries to access the URL http://x.x.x.x/federationmetadata/2007-06/federationmetadata.xml, but the ADFS service won’t answer to those requests, because it includes the ip address, and not the ADFS service name.

But there is a workaround for everything on the Internet! You can change the binding on the ADFS server nodes using netsh.

netsh http add sslcert ipport=<IPAddress:port> certhash=<certhash> appid=<appid> certstorename=MY

I will not add the necessary options to this command, because: DON’T DO THIS!

Yes, the service group, the load balancing virtual server and the content switch will come up after this change. But you will not be able to enable a trust between your ADFS Proxy servers and the ADFS farm.

Microsofts requirements on Load Balancing ADFS

Microsoft offers a nice overview about the requirements when deploying ADFS. There is a section about the Network requirements. Below this, Microsoft clearly documents the requirements when load balancing ADFS servers and ADFS Proxy servers.

The load balancer MUST NOT terminate SSL. AD FS supports multiple use cases with certificate authentication which will break when terminating SSL. Terminating SSL at the load balancer is not supported for any use case.

Requirements for deploying AD FS/ microsoft.com

Okay, with this in mind, the you can’t use a ADC Content Switch as described above. Because it will terminate SSL. You have to switch to a load balancing virtual server and a service group with SSL bridge . Citrix describes SSL bridge as follows:

A SSL bridge configured on the NetScaler appliance enables the appliance to bridge all secure traffic between the SSL client and the SSL server. The appliance does not offload or accelerate the bridged traffic, nor does it perform encryption or decryption. Only load balancing is done by the appliance. The SSL server must handle all SSL-related processing. Features such as content switching, SureConnect, and cache redirection do not work, because the traffic passing through the appliance is encrypted.

But there is a second, very interesting statement:

It is recommended to use the HTTP (not HTTPS) health probe endpoints to perform load balancer health checks for routing traffic. This avoids any issues relating to SNI. The response to these probe endpoints is an HTTP 200 OK and is served locally with no dependence on back-end services. The HTTP probe can be accessed over HTTP using the path ‘/adfs/probe’http://<Web Application Proxy name>/adfs/probe
http://<ADFS server name>/adfs/probe
http://<Web Application Proxy IP address>/adfs/probe
http://<ADFS IP address>/adfs/probe

Requirements for deploying AD FS/ microsoft.com

This is pretty interesting, because it addresses the above described issue with the monitor. The solution to this is a HTTP-ECV monitor with on port 80, a GET to “/adfs/probe” and the check for a HTTP/200.

A working Citrix ADC setup

This setup is divided into two parts: One for the ADFS farm, and a second one for the ADFS Proxy farm. It uses SSL bridge and HTTP for the service monitor.

Load balancing the ADFS farm

add server srv_adfs1 x.x.x.x
add server srv_adfs2 x.x.x.y

add serviceGroup svcgrp_adfs SSL_BRIDGE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -cltTimeout 180 -svrTimeout 360 -CKA NO -TCPB NO -CMP NO
add lb vserver lb_vsrv_adfs SSL_BRIDGE x.x.x.z 443 -persistenceType SSLSESSION -cltTimeout 180
add lb monitor mon_adfs_http HTTP -respCode 200 -httpRequest "GET /adfs/probe" -LRTM ENABLED -destPort 80

bind serviceGroup svcgrp_adfs srv_adfs1 443
bind serviceGroup svcgrp_adfs srv_adfs2 443
bind serviceGroup svcgrp_adfs -monitorName mon_adfs_http
bind lb vserver lb_vsrv_adfs svcgrp_adfs
set ssl vserver lb_vsrv_adfsproxy -ssl3 DISABLED

Load balancing the ADFS Proxy farm

add server srv_adfsproxy1 y.y.y.y
add server srv_adfsproxy2 y.y.y.x

add serviceGroup svcgrp_adfsproxy SSL_BRIDGE -maxClient 0 -maxReq 0 -cip DISABLED -usip NO -useproxyport YES -cltTimeout 180 -svrTimeout 360 -CKA NO -TCPB NO -CMP NO
add lb vserver lb_vsrv_adfsproxy SSL_BRIDGE y.y.y.z 443 -persistenceType SSLSESSION -cltTimeout 180
add lb monitor mon_adfs_proxy_http HTTP -respCode 200 -httpRequest "GET /adfs/probe" -LRTM ENABLED -destPort 80

bind serviceGroup svcgrp_adfsproxy srv_adfsproxy1 443
bind serviceGroup svcgrp_adfsproxy srv_adfsproxy2 443
bind serviceGroup svcgrp_adfs -monitorName mon_adfs_proxy_http
bind lb vserver lb_vsrv_adfsproxy svcgrp_adfsproxy
set ssl vserver lb_vsrv_adfsproxy -ssl3 DISABLED

I have implemented it on a NetScaler 12.1 with a Standard license. If you have feedback or questions, please leave a comment. :)

Supported Active Directory environments for Microsoft Exchange

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

It is time for some words of wisdom, in regard to Exchange and the supported Active Directory environments. It is the same as with the supported. NET Framework releases: Latest release does not automatically mean “supported”.

To be honest: I nearly nuked a customer environment with ~ 300 users yesterday by preparing the domain for the first Windows Server 2019 Domain Controller.

First things first: Everything is fine! I did not prepared to forest schema for Windows Server 2019.

The support for Windows Server 2008 R2 comes to an end and some customers are still running it. Like my customer yesterday. Some application servers are still on 2008 R2… and the Domain Controllers. The customer is also running Exchange 2013 on Windows Server 2012 R2.

The customer has decided to go to Windows Server 2019 wherever possible. This includes file servers, application servers, and the Domain Controllers. On of the first steps was the deployment of Active Directory-Based Activation. The AD schema needs to be prepared for this and I decided to prepare the schema for Windows Server 2019. I already copied the adprep folder from the Server 2019 ISO and openened a PowerShell. And then I paused. Something felt odd. I wanted to take a look at the Exchange Server supportability matrix.

Exchange 2013 does NOT supported Windows Server 2019 Domain Controllers! Uhh… that was unexpected.

Lessons learned

Always check the Exchange Server supportability matrix. Always! Regardless if it’s because of .NET Framework, Active Directory, Outlook Clients etc. Just check it every time you plan to change something in your environment.

Especially in regard to Microsoft Exchange “newer” does not automatically mean “supported”. Most times the opposite is true.

Microsoft Exchange 2013/ 2016/ 2019 shows blank ECP & OWA after changes to SSL certificates

This posting is ~4 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in Exchange 2013 CU6.

I published this blog post in July 2015 and it is still relevant. The feedback for this blog post was incredible, and I’m not joking when I say: I saved many admins weekends. ;) It has shown, that this error still occurs with Exchange 2016 and even 2019. Maybe not because of the same, with Exchange 2013 CU6 fixed bug, but maybe for other reasons. And the solution below still applies to it. Because of this I have decided to re-publish this blog post with a modified title and this little preamble.

Feel free to leave a comment if this blog post worked for you. :)

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

An error occurred while using SSL configuration for endpoint 0.0.0.0:444. The error status code is contained within the returned data.

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

C:\windows\system32>netsh http show sslcert

SSL Certificate bindings:
-------------------------

    IP:port                      : 0.0.0.0:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:444
    Certificate Hash             : a80c9de605a1525cd252c250495b459f06ed2ec1
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:8172
    Certificate Hash             : 09093ca95154929df92f1bee395b2670a1036a06
    Application ID               : {00000000-0000-0000-0000-000000000000}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 127.0.0.1:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

C:\windows\system32>netsh http delete sslcert ipport=0.0.0.0:444

SSL Certificate successfully deleted

Now add it again with the correct certificate hash and application ID.

C:\windows\system32>netsh http add sslcert ipport=0.0.0.0:444 certhash=1ec7413b4fb1782b4b40868d967161d29154fd7f appid="{4dc3e181-e14b-4a21-b022-59fc669b0914}"

SSL Certificate successfully added

That’s it. Reboot the Exchange server and everything should be up and running again.

Windows NPS – Authentication failed with error code 16

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, a customer called me and reported, on the first sight, a pretty weired error: Only Windows clients were unable to login into a WPA2-Enterprise wireless network. The setup itself was pretty simple: Cisco Meraki WiFi access points, a Windows Network Protection Server (NPS) on a Windows Server 2016 Domain Controller, and a Sophos SG 125 was acting as DHCP for different WiFi networks.

Pixybay / pixabay.com/ Pixabay License

Windows clients failed to authenticate, but Apple iOS, Android, and even Windows 10 Tablets had no problem.

The following error was logged into the Windows Security event log.

Authentication Details:
Connection Request Policy Name: Use Windows authentication for all users
Network Policy Name: Wireless Users
Authentication Provider: Windows
Authentication Server: domaincontroller.domain.tld
Authentication Type: PEAP
EAP Type: -
Account Session Identifier: -
Logging Results: Accounting information was written to the local log file.
Reason Code: 16
Reason: Authentication failed due to a user credentials mismatch. Either the user name provided does not map to an existing user account or the password was incorrect.

The credentials were definitely correct, the customer and I tried different user and password combinations.

I also checked the NPS network policy. When choosing PEAP as authentication type, the NPS needs a valid server certificate. This is necessary, because the EAP session is protected by a TLS tunnel. A valid certificate was given, in this case a wildcard certificate. A second certificate was also in place, this was a certificate for the domain controller from the internal enterprise CA.

It was an educated guess, but I disabled the server certificate check for the WPA2-Enterprise conntection, and the client was able to login into the WiFi. This clearly showed, that the certificate was the problem. But it was valid, all necessary CA certificates were in place and there was no reason, why the certificate was the cause.

The customer told me, that they installed updates on friday (today is monday), and a reboot of the domain controller was issued. This also restarted the NPS service, and with this restart, the Wildcard certificate was used for client connections.

I switched to the domain controller certificate, restarted the NPS, and all Windows clients were again able to connect to the WiFi.

Lessons learned

Try to avoid Wildcard certificates, or at least check the certificate that is used by the NPS, if you get authentication error with reason code 16.

Help Vembu and win a gift card!

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Vembu Technologies was founded in 2002, and with 60.000 customers and more than 4000 partners, Vembu is a leading provider with a comprehensive portfolio of software products and cloud services to small and medium businesses.

Backup is important. There is no reason to have no backup. According to an infographic published by Clutch Research at the World Backup Day 2017, 60% of all SMBs that lost all their data will shutdown within 6 months after the data loss. Pretty bad, isn’t it?

When I talk to SMB customers, most of them complain about the costs of backups. You need software, you need the hardware, and depending on the type of used hardware, you need media. And you should have a second copy of your data. In my opinion, tape is dead for SMB customers. HPE for example, offers pretty smart disk-based backup solutions, like the HPE StoreOnce.

Vembu is giving away an Amazon gift cards through a lucky draw for those readers, that take part of a short Survey

Vembu Technologies/ Vembu BDR/ Copyright by Vembu Technologies

Vembu BDR Suite provides a 30-day free trial with no restriction. This gives you the chance to intensively test Vembu BDR Suite prior purchase.

The free edition let you choose between unlimited VMs, that are covered with limited functionality, or unlimited functionality for up to 3 VMs. Check out this comparison of free, standard and enterprise edition. Check out this comparison of free, standard and enterprise edition.

Client-specific message size limits – or the reason why iOS won’t sent emails

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, a customer complained that he could not send emails with pictures with the native iOS email app. He attached three, four or five pictures to an emails, pushed the send button and instantly an error was displayed.

We checked the different connectors as well as the organizational limit for messages. The test mails were between 10 to 20 MB, and the message size limit was much higher.

geralt / pixabay.com/ Creative Commons CC0

The cross-check with Outlook Web Access indicated, that the issue was not a configured limit on one of the Exchange connectors. Instead, a quick search directed us towards the client-specific message size limits. Especially this statement caught our attention:

For any message size limit, you need to set a value that’s larger than the actual size you want enforced. This accounts for the Base64 encoding of attachments and other binary data. Base64 encoding increases the size of the message by approximately 33%, so the value you specify should be approximately 33% larger than the actual message size you want enforced. For example, if you specify a maximum message size value of 64 MB, you can expect a realistic maximum message size of approximately 48 MB.

The message size limit for Active Sync is 10 MB (Source). This is a server limit which can’t configured using the Exchange Admin Center. Taking the 33% Base64 overhead into account, the message size limit is ~ 6,5 MB.  My customer and I were able to proof this assumption. A 10 MB mail stuck in the outbox, a 6 MB mail was sent.

How to change client-specific message size limits?

In this case, my customer and I only changed the Active Sync limit. You can use the commands below to change the limit. This will rise the limit to ~ 67 MB. Without the Base64 overhead, this values allow messages sizes up to 50 MB. You have to run these commands from an administrative CMD.

%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:appSettings /[key='MaxDocumentDataSize'].value:69730304

Make sure that you restart the IIS after the changes. Run iisreset from an administrative CMD.

Please note, that you have to run these commands after you installed an Exchange Server Cumulative Update (CU), because the files, in which the changes are made, will be overwritten by the CU. This statement is from the Microsoft:

Any customized Exchange or Internet Information Server (IIS) settings that you made in Exchange XML application configuration files on the Exchange server (for example, web.config files or the EdgeTransport.exe.config file) will be overwritten when you install an Exchange CU. Be sure save this information so you can easily re-apply the settings after the install. After you install the Exchange CU, you need to re-configure these settings.

The maximum size for a message sent by Exchange Web Services clients is 64 MB, which is much more that the 10 MB for Active Sync. This might explain why customers, that use Outlook for iOS app, might not recognize this issue.

EDIT: Today I found a blog post written by Frank Zöchling in June 2018, which addresses this topic.

Veeam Backup & Replication: Backup of Microsoft Active Directory Domain Controller VMs

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

To backup a virtual machine, Veeam Backup & Replication needs two permissions:

  • permission to access and backup the VM, as well as the
  • permission to do specific tasks inside the VM

to guarantee a consistent backup. The former persmission is granted by the user account that is used to access the VMware vCenter server (sorry for the VMW focust at this point). Usually, this account has the Administrator role granted at the vCenter Server level. The latter permission is granted by a user account that has permissions inside the guest operating system.

geralt / pixabay.com/ Creative Commons CC0

Something I often see in customer environments is the usage of the Domain Administrator account. But why? Because everything works when this account is used!

There are two reasons for this:

  • This account is part of the local Administrator group on every server and client
  • customers tend to grant the Administrator role to the Domain Admins group on vCenter Server level

In simple words: Many customers use the same account to connect to the vCenter, and for the application-aware processing of Veeam Backup & Replication. At least for Windows servers backups.

Houston, we have a problem!

Everything is fine until customers have to secure their environments. One of the very first things customers do, is to protect the Administrator account. And at this point, things might go wrong.

Using a service account to connect to the vCenter server is easy. This can be any account from the Active Directory, or from the embedded VMware SSO domain. I tend to create a dedicated AD-based service account. For the necessary permissions in the vCenter, you can grant this account Administrator permissions, or you can create a new user role in the vCenter. Veeam offers a PDF document which documents the necessary permissions for the different Veeam tasks.

The next challenge is the application-aware processing. For Microsoft SQL Server, the user account must have the sysadmin privileges on the Microsoft SQL Server. For Microsoft Exchange, the user must be member of the local Administrator group. But in case of a Active Directory Domain Contoller things get complicated.

A Domain Controller does not have a local user database (SAM). So what user account or group membership is needed to backup a domain controller using application-aware processing?

This statement is from a great Veeam blog post:

Permissions: Administrative rights for target Active Directory. Account of an enterprise administrator or domain administrator.

So the service account used to backup a domain controller is one of the most powerful accounts in the active directory.

There is no other way. You need a Domain or Enterprise Administrator account. I tend to create a dedicated account for this task.

I recommend to create a service account to connect the vCenter, and which is added to the local Administrator group on the servers to backup, and I create a dedicated Domain/ Enterprise Administrator account to backup the virtual Domain Controllers.

The advantage is that I can change apply different fine-grained password policies to this accounts. Sure, you can add more security by creating more accounts for different servers, and applications, add a dedicated role to the vCenter for Veeam etc. But this apporach is easy enough to implement, and adds a significant amount of user account security to every environment that is still using DOMAIN\Administrator to backup their VMs.

Out-of-Office replies are dropped due to empty MAIL FROM

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today I had an interesting support call. A customer noticed that Out-of-Office replies were not received by recipients, even though the OoO option were enabled for internal and external recipients. Internal recipients got the OoO reply, but none of the external recipients.

cattu/ pixabay.com/ Creative Commons CC0

The Message Tracking Log is a good point to start. I quickly discovered that the Exchange server was unable to send the OoO mails. You can use the eventid FAIL to get a list of all failed messages.

Very interesting was the RecipientStatus of a failed mail.

RecipientStatus         : {[{LED=550 Requested action not taken: mailbox unavailable};{MSG=};{FQDN=mailrelay-out.xxxx.de};{IP=213.xxx.xxx.xxx};{LRT=20.12.2018 10:22:39}]}

550 Requested action not taken: mailbox unavailable  is a pretty interesting error when sending mails over a mail relay of your ISP. Especially when other mails were successfully sent over the same mail relay.

Next stop: Protcol log of the send connector

I enabled the logging on the send connector using the EAC. This option is disabled by default. Depending on the amount of mails sent over the connector, you should make sure to disable the logging after your troubleshooting session. To enable the logging, follow these steps:

  • Open the EAC and navigate to
  • Mail flow > Send connectors
  • Select the connector you want to configure, and then click Edit
  • On the General tab in the Protocol logging level section, select the Verbose option
  • When you’re finished, click Save

The protocol log can be found under %ExchangeInstallPath%TransportRoles\Logs\Hub\ProtocolLog\SmtpSend.

After enabling the logging and another test mail, the log contained the necessary details to find the root cause. This is the interesting part of the SMTP communication:

2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,3,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,220 mailrelay-out.xxx.de ESMTP Postfix (Debian/GNU),
2018-12-20T10:22:39.313Z,Relay,08D640AAC0AD8811,4,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,EHLO mail.domain.local,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,5,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,250  mailrelay-out2.xxx.de SIZE 52428800 8BITMIME OK,
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,6,192.168.0.212:49986,213.xxx.xxx.xxx:25,*,,sending message with RecordId 22471268892695 and InternetMessageId <b9613be791c141e3b76828228bd6cdb3@exchange.domain.local>
2018-12-20T10:22:39.330Z,Relay,08D640AAC0AD8811,7,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,MAIL FROM:<> SIZE=4758,
2018-12-20T10:22:39.331Z,Relay,08D640AAC0AD8811,8,192.168.0.212:49986,213.xxx.xxx.xxx:25,<,550 Requested action not taken: mailbox unavailable,
2018-12-20T10:22:39.332Z,Relay,08D640AAC0AD8811,9,192.168.0.212:49986,213.xxx.xxx.xxx:25,>,QUIT,

The error occured right after the exchange server issued MAIL FROM:<> . But why is the MAIL FROM empty?

RFC 2298 is the key

An Out-of-Office reply is a Delivery Status Notification message. And RFC 2298 clearly states:

The envelope sender address (i.e., SMTP MAIL FROM) of the MDN MUST be
null (<>), specifying that no Delivery Status Notification messages
or other messages indicating successful or unsuccessful delivery are
to be sent in response to an MDN.

So the empty MAIL FROM is something that a mail relay should expect. In case of my customer that mail relay seems to act different. Maybe some kind of spam protection.

Database Availability Group (DAG) witness is in a failed state

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of a maintenance job I had to update a 2-node Exchange Database Availability Group and a file-share witness server.

After the installation of Windows updates on the witness server and the obligatory reboot, the witness left in a failed state.

[PS] C:\Windows\system32>Get-DatabaseAvailabilityGroup -Identity DAG1 -Status | fl *wit*
WARNING: Database availability group ‘DAG01’ witness is in a failed state. The database
availability group requires the witness server to maintain quorum. Please use the
Set-DatabaseAvailabilityGroup cmdlet to re-create the witness server and the directory.

WitnessServer : fsw.domain.local
WitnessDirectory : C:\DAGFileShareWitnesses\DAG1.domain.local
AlternateWitnessServer :
AlternateWitnessDirectory :
WitnessShareInUse : InvalidConfiguration
DxStoreWitnessServers :

In my opinion, the re-creation of the witness server and the witness directory cannot be the correct way to solve this. There must be another way to solve this. In addition to this: The server was not dead. Only a reboot occured.

Check the basics

Both DAG nodes were online and working. A good starting point is a check of the cluster resources using the PowerShell.

In my case the cluster resource for the File Share Witness was in a failed state. A simple Start-ClusterResource  solved my issue immediately.

[PS] C:\Windows\system32>Get-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Failed                                            Cluster Group                                     File Share Witness


[PS] C:\Windows\system32>Get-ClusterResource | Start-ClusterResource

Name                                              State                                             OwnerGroup                                        ResourceType
----                                              -----                                             ----------                                        ------------
File Share Witness (\\fsw.domain.local            Online                                            Cluster Group                                     File Share Witness

In this case, it seems that the the cluster has marked the file share witness as unreliable, thus the resource was not started after the file share witness was back online again. In this case, I managed it to manually bring it back online by running Start-ClusterResource  on one of the DAG members.

Powering on a VM with shared VMDK fails after extending a EagerZeroedThick VMDK

This posting is ~5 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

I hope that you are not reading this blog post while searching for a solution for a failed cluster. If so, feel free to leave a comment if this blog post saved your evening or weekend. :)

Last friday, a change at one of my customers went horribly wrong. I was not onsite, but they contacted me during the night from friday to saturday, because their most important Windows Server Failover Cluster was unable to start after extending a shared VMDK.

cripi/ pixabay.com/ Creative Commons CC0

They tried something pretty simple: Extending an virtual disk of a VM. That is something most of us doing pretty often. The customer did this also pretty often. It was a well known task… Except the fact, that the VM was part of a Windows Server Failover Cluster. With shared VMDKs. And the disks were EagerZeroedThick, because this is a requirement for shared VMDKs.

They extended the disk using the vSphere Web Client. And at this point, the change was doomed to fail. They tried to power-on the VMs, but all they got was this error:

VMware ESX cannot open the virtual disk, “/vmfs/volumes/4c549ecd-66066010-e610-002354a2261b/VMNAME/VMDKNAME.vmdk” for clustering. Please verify that the virtual disk was created using the ‘thick’ option.

A shared VMDK is a VMDK in multiwriter mode. This VMDK has to be created as Thick Provision Eager Zeroed. And if you wish to extend this VMDK, you must use vmkfstools  with the option -d eagerzeroedthick. If you extend the VMDK using the Web Client, the extended portion of the disk will become LazyZeroed!

VMware has described this behaviour in the KB1033570 (Powering on the virtual machine fails with the error: Thin/TBZ disks cannot be opened in multiwriter mode). There is also a blog post by Cormac Hogan at VMware, who has described this behaviour.

That’s a screenshot from the failed cluster. Check out the type of the disk (Thick-Provision Lazy-Zeroed).

Patrick Terlisten/ vcloudnine.de/ Creative Commons CC0

You must use vmkfstools  to extend a shared VMDK – but vmkfstools is also the solution, if you have trapped into this pitfall. Clone the VMDK with option -d eagerzeroedthick.

vmkfstools -i old.vmdk new.vmdk -d eagerzeroedthick

Another solution, which was new to me, is to use Storage vMotion. You can migrate the “broken” VMDK to another datastore and change the the disk format during Storage vMotion. This solution is described in the “Notes” section of KB1033570.

Both ways will fix the problem. The result will be a Thick Provision Eager Zeroed VMDK, which will allow the VMs to be successfully powered on.