Tag Archives: troubleshooting

Why you should change your KRBTGT password prior disabling RC4

While chilling on my couch, I stumbled over this pretty interesting Reddit thread: Story Time – How I blew up my company’s AD for 24 hours and fixed it : sysadmin (reddit.com)

Long story short: A poor guy applied some STIG hardening and his Active Directory blew up. Root cause was disabling RC4, which caused Kerberos failures, primarily documented by errors like “The encryption type requested is not supported by the KDC.” The guy fixed it by shutdown all domain controllers, changing the KRBTGT account password on one domain controller, and finally, everything came back

So why blew everything up after disabling RC4? Let’s travel back in time. Microsoft released Active Directory with the release of Windows 2000. At this time, Active Directory supported DES and RC4 to encrypt Kerberos tickets. With RFC 6649 (Deprecate DES, RC4-HMAC-EXP, and Other Weak Cryptographic Algorithms in Kerberos), DES was retired in July 2012, but Microsoft disabled DES with the release of Windows Server 2008 R2 and Windows 7. Disabling DES was no big deal, because Active Directory was designed to select the highest supported cipher for encrypting the Kerberos tickets. In addition to this, support for AES was added with Windows Server 2008.

If you deploy a new domain controllers, with a higher Windows OS version, and you remove all older domain controllers, you will be able to raise the Domain, and the Forest Functional Level (DFL/ FFL). The functional levels determine the available domain or forest capabilities. With the DFL of Windows 2008, Microsoft added AES support. But only if you raise the DFL vom Windows 2003 to 2008, or any higher DFL, the KRBTGT password will be changed to get it stored AES encrypted.

Let me make this clear: This only happens when raising the DFL vom 2003 to 2008 or any higher version. Not if you go from 2008 to 2012 R2, from 2012 R2 to 2016 etc. Only from 2003 to 2008 or higher. This should be done automatically and you can verify this by checking the PasswordLastSet of the user account:

PS C:\windows\system32> Get-ADUser "krbtgt" -Property Created, PasswordLastSet


Created           : 6/22/2005 2:48:12 PM
DistinguishedName : CN=krbtgt,CN=Users,DC=mlnetwork,DC=local
Enabled           : False
GivenName         :
Name              : krbtgt
ObjectClass       : user
ObjectGUID        : dfc8490d-374f-4570-944e-d5fa41d601ab
PasswordLastSet   : 3/3/2015 8:29:08 AM
SamAccountName    : krbtgt
SID               : S-1-5-21-3103332001-754687911-2831376874-502
Surname           :
UserPrincipalName :

As you can see, the user account was created on the 22. June 2005, which is creation date of this Active Directory domain. The PasswordLastSet dates back to the 3rd March 2015, possibly the date where the DFL was raised to 2008 or above. If you don’t see a suitable date in the PasswordLastSet attribute, you should change the KRBTGT password! It looks like that there are some domains out there where the KRBTGT password wasn’t changed during the DFL raise.

Domain Controller with Windows 2008 and later will always use AES for the Ticket Granting Ticket (TGT). Once your domain functional level (DFL) is 2008 or higher, the KRBTGT account will always default to AES. For any other accounts (user and computer) the selected encryption type is determined by the msDS-supportedEncryptionTypes attribute of the account.

And let me get this pretty clear: As long as you are running Windows Server 2000, 2003, or Windows XP, you can’t disable RC4, because these operating systems simply doesn’t support AES (Source)!

So prior disable RC4, or do any hardening regarding Kerberos, make sure that you change the KRBTGT password. And change it twice. Or use the KRBTGT Reset script. You should also confirm, that your TGTs are encrypted with AES. You can check this with klist tgt.  If the TGTs are still being issued with RC4, you should check the pwdLastSet attribute on the KRBTGT account.

What should I do?

To be honest: You should change the KRBTGT password regularly, e.g. every 180 days. This is possible sind Server 2008 (and DFL/ FFL Windows 2008). This blog post of Quest gives you a pretty good summary of why and how. You might also want to take a look at this Microsoft website, if you want to know more about the KRBTGT account. You don’t need to document the password. Simply change it. If you want a more controlled way, you can use this script: Microsoft KRBTGT Reset script.

Upgrade to ESXi 7.0: Missing dependencies VIBs Error

This error gets me from time to time, regardless which server vendor, mostly on hosts that were upgraded a couple of times. In this case it was a ESXi host currently running a pretty old build of ESXi 6.7 U3 and my job was the upgrade to 7.0 Update 3c.

If you add a upgrade baseline to the cluster or host, and you try to remediate the host, the task fails with a dependency error. When taking a closer look into the taks details, you were getting told that the task has failed because of a failed dependency, but not which VIB it caused.

You can find the name if the causing VIB on the update manager tab of the host which you tried to update. The status of the baseline is “incompatible”, and not “non-compliant”.

To resolve this issue you have to remove the causing VIB. This is no big deal and can be done with esxcli. Enable SSH and open a SSH connection to the host. Then remove the VIB.

[[email protected]:~] esxcli software vib list | grep -i ssacli
ssacli                         4.17.6.0-6.7.0.7535516.hpe          HPE        PartnerSupported  2020-06-18
[[email protected]:~] esxcli software vib remove -n ssacli
Removal Result
   Message: The update completed successfully, but the system needs to be rebooted for the changes to be effective.
   Reboot Required: true
   VIBs Installed:
   VIBs Removed: HPE_bootbank_ssacli_4.17.6.0-6.7.0.7535516.hpe
   VIBs Skipped:
[[email protected]:~]

You need to reboot the host after the removal of the VIB. Then you can proceed with the update. The status of the upgrade baseline should be now “not-compliant”.

Outlook Web Access fails with “440 Login Timeout”

Today I faced an interesting problem. A customer told me that their Exchange 2010, which is currently part of a Exchange cross-forest migration project, has an issue with Outlook Web Access and the Exchange Control Panel. Both web sites fail with a white screen and a single message:

440 Login Timeout

I checked some basics, like certificate, configuration of the virtual directories and I found nothing suspicious. Most hints on the internet pointed towards problems with the IUSR_servername user, which is not used with IIS 7 and later. But authentication configuration and filesystem permissions were okay. Also the IIS end event logs were pretty unhelpful.

More interesting was the change date of the web.config! This file is part of the OWA web app and it’s typically stored under C:\Program Files\Microsoft\Exchange Server\V14\ClientAccess\Owa.

Long story short: I found this entry in the file and removed it.

<add name=”kerbauth” />

Looks like someone wanted to setup Kerberos auth for OWA, or did not reverse a change.

Escaping special characters in proxy auth passwords in vCenter

EDIT: It seems that his was fixed in vCenter 7.0 U3.

While debugging a vCener Lifecycle Manager, which was unable to download updates, I’ve stumbled over a weird behaviour, which is (IMHO) by design.

Some of you might use a proxy server. And some of you might use a proxy server which requires credentials. In my case, my customer uses a Sophos SG appliance as a web proxy server with authentication. The customer creaded a user with a complex password. But I was unable to get a working internet connection.

Image by Ed Webster from Pixabay 

I played a bit with curl on the bash of the vCenter. The proxy settings are stored under /etc/sysconfig/proxy. These settings are used to populate the http_proxy and https_proxy environment variable. It’s important to know, that the credentials stored in the /etc/sysconfig/proxy are encoded with the percent-encoding, also known as URL encoding. So someone with root access can grab credentials from these file.

But then I noticed something weird. I set the http_proxy variable manually with

http://username:password@proxy.domain.tld:8080

and I got this error:

-bash: !": event not found

Okay… there was a ! in the password and the BASH tried to execute the part behind the !. But it was part of the password, so I had to tell the BASH that it has to take this literally.

I escaped the ! in the password with a \. And to my surprise: The vCenter was able to download updates. I decoded the percent-encoded string in the /etc/sysconfig/poxy and found the escaped ! (\!). For example. Instead of Passw0rd! I had to enter Passw0rd\! in the password field.

Long story short: Use a password without special characters, otherwise escape them, because the password is stored in BASH variables.

Exchange Control Panel /ecp broken after certificate replacement

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

As part of an ongoing Exchange 2010 to 2016 migration, I had to replace the self-signed certificate with a certificate from the customers PKI. Everything went fine, the customer had a suitable template, we’ve added the necessary hostnames and bound IIS and SMTP to the certificate. The mess started with an iisreset /noforce

Bild von Oskars Zvejs auf Pixabay 

The iisreset took longer than expected. After that, I tried to login into the ECP, entered username and password and got an error.

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
  <Provider Name="MSExchange Front End HTTP Proxy" />
  <EventID Qualifiers="49152">1003</EventID>
  <Level>2</Level>
  <Task>1</Task>
  <Keywords>0x80000000000000</Keywords>
  <TimeCreated SystemTime="2020-10-22T12:16:38.934123400Z" />
  <EventRecordID>368718</EventRecordID>
  <Channel>Application</Channel>
  <Computer>server.domain.tld</Computer>
  <Security />
</System>
<EventData>
  <Data>Owa</Data>
  <Data>System.NullReferenceException: Object reference not set to an instance of an object. at 
Microsoft.Exchange.HttpProxy.FbaModule.ParseCadataCookies(HttpApplication httpApplication) at 
Microsoft.Exchange.HttpProxy.FbaModule.OnBeginRequestInternal(HttpApplication httpApplication) at 
Microsoft.Exchange.HttpProxy.ProxyModule.<>c__DisplayClass16_0.<OnBeginRequest>b__0() at 
Microsoft.Exchange.Common.IL.ILUtil.DoTryFilterCatch(Action tryDelegate, Func`2 filterDelegate, Action`1 catchDelegate)
</Data>
</EventData>
</Event>

Pretty strange. We switched back to the self-singned certificate, did an iisreset and everyting was fine again.So it was pretty obvious that the error was related to the certificate, or to be more clear, to the certificate template.

A short research confirmed this. The template was a modified v3 web server template from an Enterprise CA running Windows Server 2008 R2.

With Windows Server 2008, Microsoft introduced a new cryptographic API called Cryptography Next Generation (CNG), which separates cryptographic providers (algorithm implementation) from key storage providers (create, delete, export, import, open and store keys). The older CryptoAPI does not differ between this and implements cryptographic algorithms and key storage.

The modified template used CNG instead of CryptoAPI. We noticed this when we checked the certificate with certutil -store my <thumbprint>.

If the listed provider for the certificate is Microsoft Software Key Storage Provider, then you will have to re-import the certificate. If Microsoft RSA SChannel Cryptographic Provider is used, everything is fine.

You have to remove the certificate, then re-import it using

certutil -csp "Microsoft RSA SChannel Cryptographic Provider" -importpfx <CertificateFilename>

You need a PKCS#12 file (PFX) and the password. Re-import it and then you can use the certificate for Exchange. Bind services to it and restart the IIS.

Virtually reseated: Reset blade in a HPE C7000 enclosure

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

After a reboot, a VMware ESXi 6.7 U3 told me that he has no compatible NICs. Fun fact: Right before the reboot everything was fine.

The ILO also showed no NICs. Unfortunately, I wasn’t onsite to pull the blade server and put it back in. But there is a way to do this “virtually”.

You have to connect to the IP address of the Onboard Administrator via SSH. Then issue the reset server command with the bay of the server you want to reset and an argument.

OA1-C7000> reset server 13

WARNING: Resetting the server trips its E-Fuse. This causes all power to be momentarily removed from the server. This command should only be used when physical access to the server is unavailable, and the server must be removed and
reinserted.

Any disk operations on direct attached storage devices will be affected. I/O
will be interrupted on any direct attached I/O devices.

Entering anything other than 'YES' will result in the command not executing.

Do you want to continue ? yes

Successfully reset the E-Fuse for device bay 13.

The server will power up automatically. Please note, that the OBA is unable to display certain information right after this operation. It will take a couple of minutes until all information, like serial number or device bay name are visible again.

Update Manager fails with unknown error during host remediation

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

During an vSphere 6.5 > 6.7 update a was host failing continously at the remediation with an “unknown error”. The host was updated from ESXI 6.5 to 6.7 using an upgrade baseline. Other hosts were updated to 6.7 and with the latest patches without any issues. Something strange was going on…

The esxupdate.log and the vua.log on the host itself showed nothing special. So I checked the vmware-vum-server-log4cpp.log which was much more informative!

[2020-07-19 13:03:25:217 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [singleHostScanTask, 693] caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?, ?, ?, ?, ?, ?, ?)"
[2020-07-19 13:03:25:219 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [singleHostScanTask, 404] SingleHostScan caught exception: caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?, ?, ?, ?, ?, ?, ?)" with code: -1
[2020-07-19 13:03:25:223 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [vciTaskBase, 568] Task execution has failed: caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?

Well… ERROR: duplicate key value violates unique constraint “pk_vci_scanresults” is not what I expected, but it is an error, and it occured everytime I tried to remediate the host.

Google found nothing about this error, so I decided to reset the VUM database. Please don’t try this at your customer! Log a call at VMware.

To reset the VUM database:

  1. Connect to vCenter Server Appliance via SSH
  2. Switch to the BASH 
  3. Stop the VMware Update Manager Service with this command

    service-control –stop vmware-updatemgr
     
  4. To reset the VMware Update Manager Database (applies only to VCSA 6.7 and 7.0!)

    /usr/lib/vmware-updatemgr/bin/updatemgr-utility.py reset-db
  1. Delete the contents of the VMware Update Manager Patch Store

    rm -rf /storage/updatemgr/patch-store/*
     
  2. Start the VMware Update Manager Service again

    service-control –start vmware-updatemgr

You will lose all your baselines, so you have to configure them again. And you need to download all patches again.

For vSAN environments this procedure will also remove the vSAN default baselines, but they will recreated automatically when there is a configuration change to vSAN or an update to the HCL DB. Again: Don’t do this at home!

vCenter Migration from 6.0 to 6.7 fails due to missing user role

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Actually, yesterday should be the day at which I migrate one of the last physical Windows vCenter servers installed in my customer base. Actually… the migration failed twice. And each time I had to rollback, power-on the old physical server, reset the computer account etc.

The update was from VMware vCenter Server 6.0 Update 3d (7462484) on a Windows 2012 R2 server to vCenter Server 6.7 Update 3 (Appliance). The migration failed at 62% with the following message:

Traceback (most recent call last):
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 219, in Main
    vdc_fb.register_cis()
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 77, in register_cis
    self._reg_info.registerAll(self.get_soluser_id(), self.get_soluser_ownerId())
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 368, in registerAll
    self.registerUserAndService(user_name, user_id, service)
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 395, in registerUserAndService
    add_vmtx_privileges(self.vdc_cfg_dir)
  File "/usr/lib/vmware-content-library/install_lib/add_vmtx_privileges_after_fb.py", line 105, in add_vmtx_privileges
    log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))
AttributeError: 'NoneType' object has no attribute 'name'

I found the same error in the content-library-firstboot.py_9150_stderr.log file of the downloaded log bundle.

Okay, that’s a pretty long error message and I had no idea where I should start searching. But it seems related to the Content Library of the vCenter. And it looks like it is related to the privileges.

log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))

A forum post led me to the content library administrator role. The author had to deal with a failed migration (6.5 to 6.7), but his conten administrator role was missing. In my case, the role was existent.

Sorry for the german translation. As you can see, the role was existent… Obviously. I tried to add a new role with the name com.vmware.Content.Admin, as mentioned in the forum post, and… a new role appeared.

You might notice the “Beispiel” or “Example”. That’s the difference. Whatever the other role is or what its look like, it is definitely not the original content library administrator role.

And to make a long story short: The migration was successful after this small change.

Microsoft Exchange 2013/ 2016/ 2019 shows blank ECP & OWA after changes to SSL certificates

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in Exchange 2013 CU6.

I published this blog post in July 2015 and it is still relevant. The feedback for this blog post was incredible, and I’m not joking when I say: I saved many admins weekends. ;) It has shown, that this error still occurs with Exchange 2016 and even 2019. Maybe not because of the same, with Exchange 2013 CU6 fixed bug, but maybe for other reasons. And the solution below still applies to it. Because of this I have decided to re-publish this blog post with a modified title and this little preamble.

Feel free to leave a comment if this blog post worked for you. :)

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

An error occurred while using SSL configuration for endpoint 0.0.0.0:444. The error status code is contained within the returned data.

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

C:\windows\system32>netsh http show sslcert

SSL Certificate bindings:
-------------------------

    IP:port                      : 0.0.0.0:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:444
    Certificate Hash             : a80c9de605a1525cd252c250495b459f06ed2ec1
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:8172
    Certificate Hash             : 09093ca95154929df92f1bee395b2670a1036a06
    Application ID               : {00000000-0000-0000-0000-000000000000}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 127.0.0.1:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

C:\windows\system32>netsh http delete sslcert ipport=0.0.0.0:444

SSL Certificate successfully deleted

Now add it again with the correct certificate hash and application ID.

C:\windows\system32>netsh http add sslcert ipport=0.0.0.0:444 certhash=1ec7413b4fb1782b4b40868d967161d29154fd7f appid="{4dc3e181-e14b-4a21-b022-59fc669b0914}"

SSL Certificate successfully added

That’s it. Reboot the Exchange server and everything should be up and running again.

NetScaler Gateway – Cannot complete your request

This posting is ~3 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer reported a weird problem with his NetScaler Gateway. Upon the first load of the website, they got an error “Cannot complete your request”. After clicking OK the error disappeared and does not occured again after reloading the website. Only after closing and re-opening the browser. I got this message in Firefox and Internet Explorer, but not from a remote machine, e.g. my PC at the office.

I found no configuration error or something, that would have explained this message. Finally, I found something that caught my attention:

HTTP/1.1 412 Precondition Failed

I found this using the Firefox Web Development Tools (I only had a Firefox and IE on my remote machine). With this message I found CTX244520 which also explained this error. The issue is caused by a hidden feature for caching web site data of the Gateway vServer. If you don’t have Integrated Cache feature licensed or enable, this feature failes. It is called Static Page Caching.

My customer is currently running NS12.0 60.10, and this issue is fixed in 12.0 61.8. And the customer is using a custom theme, which is based on one of the included themes.

If possible you can enable Integrated Caching. If you can’t enable Integrated Caching, you can simple disable this feature:

   show aaa parameter
   Configured AAA parameters
           EnableStaticPageCaching: YES
           EnableEnhancedAuthFeedback: NO
           DefaultAuthType: LOCAL  MaxAAAUsers: 1000
           AAAD nat ip: None
           EnableSessionStickiness : NO
           aaaSessionLoglevel : INFORMATIONAL
           AAAD Log Level : INFORMATIONAL
           Dynamic address: OFF
           GUI mode: ON
           Max Saml Deflate Size: 1024
    Done
   set aaa parameter -enableStaticPageCaching NO
    Done