Tag Archives: troubleshooting

Exchange Control Panel /ecp broken after certificate replacement

As part of an ongoing Exchange 2010 to 2016 migration, I had to replace the self-signed certificate with a certificate from the customers PKI. Everything went fine, the customer had a suitable template, we’ve added the necessary hostnames and bound IIS and SMTP to the certificate. The mess started with an iisreset /noforce

Bild von Oskars Zvejs auf Pixabay 

The iisreset took longer than expected. After that, I tried to login into the ECP, entered username and password and got an error.

<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
<System>
  <Provider Name="MSExchange Front End HTTP Proxy" />
  <EventID Qualifiers="49152">1003</EventID>
  <Level>2</Level>
  <Task>1</Task>
  <Keywords>0x80000000000000</Keywords>
  <TimeCreated SystemTime="2020-10-22T12:16:38.934123400Z" />
  <EventRecordID>368718</EventRecordID>
  <Channel>Application</Channel>
  <Computer>server.domain.tld</Computer>
  <Security />
</System>
<EventData>
  <Data>Owa</Data>
  <Data>System.NullReferenceException: Object reference not set to an instance of an object. at 
Microsoft.Exchange.HttpProxy.FbaModule.ParseCadataCookies(HttpApplication httpApplication) at 
Microsoft.Exchange.HttpProxy.FbaModule.OnBeginRequestInternal(HttpApplication httpApplication) at 
Microsoft.Exchange.HttpProxy.ProxyModule.<>c__DisplayClass16_0.<OnBeginRequest>b__0() at 
Microsoft.Exchange.Common.IL.ILUtil.DoTryFilterCatch(Action tryDelegate, Func`2 filterDelegate, Action`1 catchDelegate)
</Data>
</EventData>
</Event>

Pretty strange. We switched back to the self-singned certificate, did an iisreset and everyting was fine again.So it was pretty obvious that the error was related to the certificate, or to be more clear, to the certificate template.

A short research confirmed this. The template was a modified v3 web server template from an Enterprise CA running Windows Server 2008 R2.

With Windows Server 2008, Microsoft introduced a new cryptographic API called Cryptography Next Generation (CNG), which separates cryptographic providers (algorithm implementation) from key storage providers (create, delete, export, import, open and store keys). The older CryptoAPI does not differ between this and implements cryptographic algorithms and key storage.

The modified template used CNG instead of CryptoAPI. We noticed this when we checked the certificate with certutil -store my <thumbprint>.

If the listed provider for the certificate is Microsoft Software Key Storage Provider, then you will have to re-import the certificate. If Microsoft RSA SChannel Cryptographic Provider is used, everything is fine.

You have to remove the certificate, then re-import it using

certutil -csp "Microsoft RSA SChannel Cryptographic Provider" -importpfx <CertificateFilename>

You need a PKCS#12 file (PFX) and the password. Re-import it and then you can use the certificate for Exchange. Bind services to it and restart the IIS.

Virtually reseated: Reset blade in a HPE C7000 enclosure

After a reboot, a VMware ESXi 6.7 U3 told me that he has no compatible NICs. Fun fact: Right before the reboot everything was fine.

The ILO also showed no NICs. Unfortunately, I wasn’t onsite to pull the blade server and put it back in. But there is a way to do this “virtually”.

You have to connect to the IP address of the Onboard Administrator via SSH. Then issue the reset server command with the bay of the server you want to reset and an argument.

OA1-C7000> reset server 13

WARNING: Resetting the server trips its E-Fuse. This causes all power to be momentarily removed from the server. This command should only be used when physical access to the server is unavailable, and the server must be removed and
reinserted.

Any disk operations on direct attached storage devices will be affected. I/O
will be interrupted on any direct attached I/O devices.

Entering anything other than 'YES' will result in the command not executing.

Do you want to continue ? yes

Successfully reset the E-Fuse for device bay 13.

The server will power up automatically. Please note, that the OBA is unable to display certain information right after this operation. It will take a couple of minutes until all information, like serial number or device bay name are visible again.

Update Manager fails with unknown error during host remediation

During an vSphere 6.5 > 6.7 update a was host failing continously at the remediation with an “unknown error”. The host was updated from ESXI 6.5 to 6.7 using an upgrade baseline. Other hosts were updated to 6.7 and with the latest patches without any issues. Something strange was going on…

The esxupdate.log and the vua.log on the host itself showed nothing special. So I checked the vmware-vum-server-log4cpp.log which was much more informative!

[2020-07-19 13:03:25:217 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [singleHostScanTask, 693] caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?, ?, ?, ?, ?, ?, ?)"
[2020-07-19 13:03:25:219 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [singleHostScanTask, 404] SingleHostScan caught exception: caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?, ?, ?, ?, ?, ?, ?)" with code: -1
[2020-07-19 13:03:25:223 'SingleHostScanTask.SingleHostScanTask{262}' 139762329831168 ERROR] [vciTaskBase, 568] Task execution has failed: caught an odbc error: "ODBC error: (23505) - ERROR: duplicate key value violates unique constraint "pk_vci_scanresults"; Error while executing the query" is returned when executing SQL statement "INSERT INTO VCI_SCANRESULTS(endtime, id, scan_status, scan_type, starttime, target_component, target_uid) VALUES (?

Well… ERROR: duplicate key value violates unique constraint “pk_vci_scanresults” is not what I expected, but it is an error, and it occured everytime I tried to remediate the host.

Google found nothing about this error, so I decided to reset the VUM database. Please don’t try this at your customer! Log a call at VMware.

To reset the VUM database:

  1. Connect to vCenter Server Appliance via SSH
  2. Switch to the BASH 
  3. Stop the VMware Update Manager Service with this command

    service-control –stop vmware-updatemgr
     
  4. To reset the VMware Update Manager Database (applies only to VCSA 6.7 and 7.0!)

    /usr/lib/vmware-updatemgr/bin/updatemgr-utility.py reset-db
  1. Delete the contents of the VMware Update Manager Patch Store

    rm -rf /storage/updatemgr/patch-store/*
     
  2. Start the VMware Update Manager Service again

    service-control –start vmware-updatemgr

You will lose all your baselines, so you have to configure them again. And you need to download all patches again.

For vSAN environments this procedure will also remove the vSAN default baselines, but they will recreated automatically when there is a configuration change to vSAN or an update to the HCL DB. Again: Don’t do this at home!

vCenter Migration from 6.0 to 6.7 fails due to missing user role

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Actually, yesterday should be the day at which I migrate one of the last physical Windows vCenter servers installed in my customer base. Actually… the migration failed twice. And each time I had to rollback, power-on the old physical server, reset the computer account etc.

The update was from VMware vCenter Server 6.0 Update 3d (7462484) on a Windows 2012 R2 server to vCenter Server 6.7 Update 3 (Appliance). The migration failed at 62% with the following message:

Traceback (most recent call last):
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 219, in Main
    vdc_fb.register_cis()
  File "/usr/lib/vmware-content-library/firstboot/content-library-firstboot.py", line 77, in register_cis
    self._reg_info.registerAll(self.get_soluser_id(), self.get_soluser_ownerId())
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 368, in registerAll
    self.registerUserAndService(user_name, user_id, service)
  File "/usr/lib/vmware-content-library/install_lib/cis_register.py", line 395, in registerUserAndService
    add_vmtx_privileges(self.vdc_cfg_dir)
  File "/usr/lib/vmware-content-library/install_lib/add_vmtx_privileges_after_fb.py", line 105, in add_vmtx_privileges
    log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))
AttributeError: 'NoneType' object has no attribute 'name'

I found the same error in the content-library-firstboot.py_9150_stderr.log file of the downloaded log bundle.

Okay, that’s a pretty long error message and I had no idea where I should start searching. But it seems related to the Content Library of the vCenter. And it looks like it is related to the privileges.

log("Adding privileges [%s] to role %s" % (' '.join(VMTX_SYNC_PRIVILEGES), cls_admin_role.name))

A forum post led me to the content library administrator role. The author had to deal with a failed migration (6.5 to 6.7), but his conten administrator role was missing. In my case, the role was existent.

Sorry for the german translation. As you can see, the role was existent… Obviously. I tried to add a new role with the name com.vmware.Content.Admin, as mentioned in the forum post, and… a new role appeared.

You might notice the “Beispiel” or “Example”. That’s the difference. Whatever the other role is or what its look like, it is definitely not the original content library administrator role.

And to make a long story short: The migration was successful after this small change.

Microsoft Exchange 2013/ 2016/ 2019 shows blank ECP & OWA after changes to SSL certificates

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.
EDIT
This issue is described in KB2971270 and is fixed in Exchange 2013 CU6.

I published this blog post in July 2015 and it is still relevant. The feedback for this blog post was incredible, and I’m not joking when I say: I saved many admins weekends. ;) It has shown, that this error still occurs with Exchange 2016 and even 2019. Maybe not because of the same, with Exchange 2013 CU6 fixed bug, but maybe for other reasons. And the solution below still applies to it. Because of this I have decided to re-publish this blog post with a modified title and this little preamble.

Feel free to leave a comment if this blog post worked for you. :)

I ran a couple of times in this error. After applying changes to SSL certificates (add, replace or delete a SSL certificate) and rebooting the server, the event log is flooded with events from source “HttpEvent” and event id 15021. The message says:

An error occurred while using SSL configuration for endpoint 0.0.0.0:444. The error status code is contained within the returned data.

If you try to access the Exchange Control Panel (ECP) or Outlook Web Access (OWA), you will get a blank website. To solve this issue, open up an elevated command prompt on your Exchange 2013 server.

C:\windows\system32>netsh http show sslcert

SSL Certificate bindings:
-------------------------

    IP:port                      : 0.0.0.0:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:444
    Certificate Hash             : a80c9de605a1525cd252c250495b459f06ed2ec1
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 0.0.0.0:8172
    Certificate Hash             : 09093ca95154929df92f1bee395b2670a1036a06
    Application ID               : {00000000-0000-0000-0000-000000000000}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

    IP:port                      : 127.0.0.1:443
    Certificate Hash             : 1ec7413b4fb1782b4b40868d967161d29154fd7f
    Application ID               : {4dc3e181-e14b-4a21-b022-59fc669b0914}
    Certificate Store Name       : MY
    Verify Client Certificate Revocation : Enabled
    Verify Revocation Using Cached Client Certificate Only : Disabled
    Usage Check                  : Enabled
    Revocation Freshness Time    : 0
    URL Retrieval Timeout        : 0
    Ctl Identifier               : (null)
    Ctl Store Name               : (null)
    DS Mapper Usage              : Disabled
    Negotiate Client Certificate : Disabled

Check the certificate hash and appliaction ID for 0.0.0.0:443, 0.0.0.0:444 and 127.0.0.1:443. You will notice, that the application ID for this three entries is the same, but the certificate hash for 0.0.0.0:444 differs from the other two entries. And that’s the point. Remove the certificate for 0.0.0.0:444.

C:\windows\system32>netsh http delete sslcert ipport=0.0.0.0:444

SSL Certificate successfully deleted

Now add it again with the correct certificate hash and application ID.

C:\windows\system32>netsh http add sslcert ipport=0.0.0.0:444 certhash=1ec7413b4fb1782b4b40868d967161d29154fd7f appid="{4dc3e181-e14b-4a21-b022-59fc669b0914}"

SSL Certificate successfully added

That’s it. Reboot the Exchange server and everything should be up and running again.

NetScaler Gateway – Cannot complete your request

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

A customer reported a weird problem with his NetScaler Gateway. Upon the first load of the website, they got an error “Cannot complete your request”. After clicking OK the error disappeared and does not occured again after reloading the website. Only after closing and re-opening the browser. I got this message in Firefox and Internet Explorer, but not from a remote machine, e.g. my PC at the office.

I found no configuration error or something, that would have explained this message. Finally, I found something that caught my attention:

HTTP/1.1 412 Precondition Failed

I found this using the Firefox Web Development Tools (I only had a Firefox and IE on my remote machine). With this message I found CTX244520 which also explained this error. The issue is caused by a hidden feature for caching web site data of the Gateway vServer. If you don’t have Integrated Cache feature licensed or enable, this feature failes. It is called Static Page Caching.

My customer is currently running NS12.0 60.10, and this issue is fixed in 12.0 61.8. And the customer is using a custom theme, which is based on one of the included themes.

If possible you can enable Integrated Caching. If you can’t enable Integrated Caching, you can simple disable this feature:

   show aaa parameter
   Configured AAA parameters
           EnableStaticPageCaching: YES
           EnableEnhancedAuthFeedback: NO
           DefaultAuthType: LOCAL  MaxAAAUsers: 1000
           AAAD nat ip: None
           EnableSessionStickiness : NO
           aaaSessionLoglevel : INFORMATIONAL
           AAAD Log Level : INFORMATIONAL
           Dynamic address: OFF
           GUI mode: ON
           Max Saml Deflate Size: 1024
    Done
   set aaa parameter -enableStaticPageCaching NO
    Done

Out of space – first steps when a datastore runs out of space

This posting is ~1 year years old. You should keep this in mind. IT is a short living business. This information might be outdated.

This is a situation that never should happen, and I had to deal with it only a couple of times in more than 10y working with VMware vSphere/ ESXi. In most cases, the reason for this was the usage of thin-provisioned disks together with small datastores. Yes, that’s a bad design. Yes, this should never happen.

There is a nearly 100% chance that this setup will fail one day. Either because someone dumps much data into the VMs, or because of VM snapshots. But such a setip WILL FAIL one day.

Yesterday was one of these days and five VMs have stopped working on a small ESXi in a site of one of my customers. A quick look into the vCenter confirmed my first assumption. The datastore was full. My second thought: Why are there so many VMs on that small ESXi host, and why they are thin-provisioned?

The vCenter showed me the following message on each VM:

There is no more space for virtual disk $VMNAME.vmdk. You might be able to continue this session by freeing disk space on the relevant volume, and clicking Retry. Click Cancel to terminate this session.

Okay, what to do? First things first:

  1. Is there any unallocated space left on the RAID group? If yes, expand the VMFS.
  2. Are there any VM snapshots left? If yes, remove them
  3. Configure 100% memory reservation for the VMs. This removes the VM memory swap files and releases a decent amout of disk space
  4. Remove ISO files from the datastore
  5. Remove VMs (if you have a backup and they are not necessary for the business)

This should allow you to continue the operation of the VMs. To solve the problem permanently:

  1. Add disks to the server and expand the VMFS, or create a new datastore
  2. Add a NFS datastore
  3. Remove unnecessary VMs
  4. Setup a working monitoring , setup alarms, do not overprovision datastores, or switch to eager-zeroed disks

Such an issues should not happen. It is not rude to say here: This is simply due to bad design and lack of operational processes.

Windows NPS – Authentication failed with error code 16

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Today, a customer called me and reported, on the first sight, a pretty weired error: Only Windows clients were unable to login into a WPA2-Enterprise wireless network. The setup itself was pretty simple: Cisco Meraki WiFi access points, a Windows Network Protection Server (NPS) on a Windows Server 2016 Domain Controller, and a Sophos SG 125 was acting as DHCP for different WiFi networks.

Pixybay / pixabay.com/ Pixabay License

Windows clients failed to authenticate, but Apple iOS, Android, and even Windows 10 Tablets had no problem.

The following error was logged into the Windows Security event log.

Authentication Details:
Connection Request Policy Name: Use Windows authentication for all users
Network Policy Name: Wireless Users
Authentication Provider: Windows
Authentication Server: domaincontroller.domain.tld
Authentication Type: PEAP
EAP Type: -
Account Session Identifier: -
Logging Results: Accounting information was written to the local log file.
Reason Code: 16
Reason: Authentication failed due to a user credentials mismatch. Either the user name provided does not map to an existing user account or the password was incorrect.

The credentials were definitely correct, the customer and I tried different user and password combinations.

I also checked the NPS network policy. When choosing PEAP as authentication type, the NPS needs a valid server certificate. This is necessary, because the EAP session is protected by a TLS tunnel. A valid certificate was given, in this case a wildcard certificate. A second certificate was also in place, this was a certificate for the domain controller from the internal enterprise CA.

It was an educated guess, but I disabled the server certificate check for the WPA2-Enterprise conntection, and the client was able to login into the WiFi. This clearly showed, that the certificate was the problem. But it was valid, all necessary CA certificates were in place and there was no reason, why the certificate was the cause.

The customer told me, that they installed updates on friday (today is monday), and a reboot of the domain controller was issued. This also restarted the NPS service, and with this restart, the Wildcard certificate was used for client connections.

I switched to the domain controller certificate, restarted the NPS, and all Windows clients were again able to connect to the WiFi.

Lessons learned

Try to avoid Wildcard certificates, or at least check the certificate that is used by the NPS, if you get authentication error with reason code 16.

Client-specific message size limits – or the reason why iOS won’t sent emails

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

Last week, a customer complained that he could not send emails with pictures with the native iOS email app. He attached three, four or five pictures to an emails, pushed the send button and instantly an error was displayed.

We checked the different connectors as well as the organizational limit for messages. The test mails were between 10 to 20 MB, and the message size limit was much higher.

geralt / pixabay.com/ Creative Commons CC0

The cross-check with Outlook Web Access indicated, that the issue was not a configured limit on one of the Exchange connectors. Instead, a quick search directed us towards the client-specific message size limits. Especially this statement caught our attention:

For any message size limit, you need to set a value that’s larger than the actual size you want enforced. This accounts for the Base64 encoding of attachments and other binary data. Base64 encoding increases the size of the message by approximately 33%, so the value you specify should be approximately 33% larger than the actual message size you want enforced. For example, if you specify a maximum message size value of 64 MB, you can expect a realistic maximum message size of approximately 48 MB.

The message size limit for Active Sync is 10 MB (Source). This is a server limit which can’t configured using the Exchange Admin Center. Taking the 33% Base64 overhead into account, the message size limit is ~ 6,5 MB.  My customer and I were able to proof this assumption. A 10 MB mail stuck in the outbox, a 6 MB mail was sent.

How to change client-specific message size limits?

In this case, my customer and I only changed the Active Sync limit. You can use the commands below to change the limit. This will rise the limit to ~ 67 MB. Without the Base64 overhead, this values allow messages sizes up to 50 MB. You have to run these commands from an administrative CMD.

%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Default Web Site/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.webServer/security/requestFiltering /requestLimits.maxAllowedContentLength:69730304
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:system.web/httpRuntime /maxRequestLength:68096
%windir%\system32\inetsrv\appcmd.exe set config "Exchange Back End/Microsoft-Server-ActiveSync/" -section:appSettings /[key='MaxDocumentDataSize'].value:69730304

Make sure that you restart the IIS after the changes. Run iisreset from an administrative CMD.

Please note, that you have to run these commands after you installed an Exchange Server Cumulative Update (CU), because the files, in which the changes are made, will be overwritten by the CU. This statement is from the Microsoft:

Any customized Exchange or Internet Information Server (IIS) settings that you made in Exchange XML application configuration files on the Exchange server (for example, web.config files or the EdgeTransport.exe.config file) will be overwritten when you install an Exchange CU. Be sure save this information so you can easily re-apply the settings after the install. After you install the Exchange CU, you need to re-configure these settings.

The maximum size for a message sent by Exchange Web Services clients is 64 MB, which is much more that the 10 MB for Active Sync. This might explain why customers, that use Outlook for iOS app, might not recognize this issue.

EDIT: Today I found a blog post written by Frank Zöchling in June 2018, which addresses this topic.

Veeam and StoreOnce: Wrong FC-HBA driver/ firmware causes Windows BSoD

This posting is ~2 years years old. You should keep this in mind. IT is a short living business. This information might be outdated.

One of my customers bought a very nice new backup solution, which consists of a

  • HPE StoreOnce 5100 with ~ 144 TB usable capacity,
  • and a new HPE ProLiant DL380 Gen10 with Windows Server 2016

as new backup server. StoreOnce and backup server will be connected with 8 Gb Fibre-Channel and 10 GbE to the existing network and SAN. Veeam Backup & Replication 9.5 U3a is already in use, as well as VMware vSphere 6.5 Enterprise Plus. The backend storage is a HPE 3PAR 8200.

This setup allows the usage of Catalyst over Fibre-Channel together with Veeam Storage Snapshots, and this was intended to use.

I wrote about a similar setup some month ago: Backup from a secondary HPE 3PAR StoreServ array with Veeam Backup & Replication.

The OS on the StoreOnce was up-to-date (3.16.7), Windows Server 2016 was installed using HPE Intelligent Provisioning. Afterwards, a drivers and firmware were updated using the latest SPP 2018.11 was installed. So all drivers and firmware were also up-to-date.

After doing zoning and some other configuration tasks, I installed Veeam Backup and Replication 9.5 U3, configured my Catalyst over Fibre-Channel repository. I configured a test backup… and the server failed with a Blue Screen of Death… which is pretty rare since Server 2008 R2.

geralt / pixabay.com/ Creative Commons CC0

I did some tests:

  • backup from 3PAR Storage Snapshots to Catalyst over FC repository – BSoD
  • backup without 3PAR Storage Snapshots to Catalyst over FC repository – BSoD
  • backup from 3PAR Storage Snapshots to Catalyst over LAN repository – works fine
  • backup without 3PAR Storage Snapshots to Catalyst over LAN repository – works fine
  • backup from 3PAR Storage Snapshots to default repository – works fine
  • backup without 3PAR Storage Snapshots to default repository – works fine

So the error must be caused by the usage of Catalyst over Fibre-Channel. I filed a case at HPE, uploaded gigabytes of memory dumps and heard pretty less during the next week.

HPE StoreOnce Support Matrix FTW!

After a week, I got an email from the HPE support with a question about the installed HBA driver and firmware. I told them the version number and a day later I was requested to downgrade (!) drivers and firmware.

The customer has got a SN1100Q (P9D93A & P9D94A) HBA in his backup server, and I was requested to downgrade the firmware to version 8.05.61, as well as the driver to 9.2.5.20. And with this firmware and driver version, the backup was running fine (~ 750 MB/s hroughput).

I found the HPE StoreOnce Support Matrix on the SPOCK website from HPE. The matrix confirmed the firmware and driver version requirement (click to enlarge).

Fun fact: None of the listed HBAs (except the Synergy HBAs) is supported with the latest StoreOnce G2 products.

Lessons learned

You should take a look at those support matrices – always! HPE confirmed that the first level recommendation “Have you trieed to update to the latest firmware” can cause similar problems. The fact, that the factory ships the server with the latest firmware does not make this easier.