TL;DR: There’s a script at the bottom of the page that fixes the issue.
Some days ago, this HPE customer advisory caught my attention:
And there is also a corrosponding VMware KB article:
ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers
It isn’t clear WHY this setting was changed, but in VMware ESXi 5.5 patch 10, 6.0 patch 4, 6.0 U3 and, 6.5 the Intel IOMMU’s interrupt remapper functionality was disabled. So if you are running these ESXi versions on a HPE ProLiant Gen8, you might want to check if you are affected.
To make it clear again, only HPE ProLiant Gen8 models are affected. No newer (Gen9) or older (G6, G7) models.
Currently there is no resolution, only a workaround. The iovDisableIR setting must set to FALSE. If it’s set to TRUE, the Intel IOMMU’s interrupt remapper functionality is disabled.
To check this setting, you have to SSH to each host, and use esxcli to check the current setting:
[root@esx1:~] esxcli system settings kernel list -o iovDisableIR Name Type Description Configured Runtime Default ------------ ---- --------------------------------------- ---------- ------- ------- iovDisableIR Bool Disable Interrupt Routing in the IOMMU... FALSE FALSE TRUE
I have written a small PowerCLI script that uses the Get-EsxCli cmdlet to check all hosts in a cluster. The script only checks the setting, it doesn’t change the iovDisableIR setting.
Here’s another script, that analyzes and fixes the issue.
- Failed to connect to IKEv2 VPN using iPhone USB tethering - June 26, 2023
- Why you should change your KRBTGT password prior disabling RC4 - July 28, 2022
- Use app-only authentication with the Microsoft Graph PowerShell SDK - July 22, 2022
I worked extensively with HPE and VMware in late November and early December 2016 to identify and troubleshoot this issue – my testing showed that for us, this issue only appears for Gen8 servers with Intel Ivy Bridge procs(v2), and only sporadically when they were under significant load; our Gen8 servers with Intel Sandy Bridge (v0) procs were not affected but as always YMMV – I would (and did) follow the KB recommendation is to revert the setting to FALSE for ALL Gen8 servers. I would like to clarify that the change in the default setting to TRUE for iovDisableIR was in fact made PRIOR to 6.0 U3 as you indicated, it actually was changed in ESXi 6.0 Patch 4 build-4600944 released 2016-11-22. The HPE Advisory is correct in this regard, but the VMware KB also incorrectly states the change occurred for ESXi 6.0 in Update 3 and I have submitted a request to have it corrected. VMware advised me that the reason for this reversal of the default setting of iovDisableIR was that the prior default of FALSE “was causing issues with other vendors systems” although I could not get anything more specific out of them than that – they would not identify to me which vendors were affected or what issues were caused.
I would think that this issue could be fixed with a Gen8 BIOS revision, as the latest available for my affected DL380p servers was released 7/1/2015, but I was not given any information by HPE to indicate that a new release was forthcoming or even being worked on so for now please follow the Advisory/KB recommendation.
Hi Patrick,
thank you for your detailed information about this issue. Much appreciated!
Hi Patrick. Very useful information.
Maybe a completely fool question. Gen9 servers are completely out of scope for the KB? In your opinion, can be considered in any way a best practice changing iovDisableIR to FALSE in Gen9 servers?
Hi,
according to Patrick Longs comment, Gen9 server are out of scope. It seems, that only Gen8 servers with Intel Ivy Bridge procs (v2 CPUs) are affected.
https://kb.vmware.com/selfservice/microsites/search.do?language=en_US&cmd=displayKC&externalId=2147325
Says basically to not set the setting to false?
Resolution
VMWare recommends contacting the hardware manufacturer for updated BIOS or possible workarounds.
Note: A prior version of this KB article recommended that customers experiencing the problem described above work around it by configuring ESXi to disable the Intel® VT-d interrupt remapper (setting boot option iovDisableIR=FALSE and rebooting). VMware ESXi 5.5 p10, 6.0 p04, 6.0 U3 and 6.5 by default disable the Intel® VT-d interrupt remapper for this purpose.
VMware has recently received several reports indicating that disabling the Intel® VT-d interrupt remapper is causing ESXi host failure on HPE
Gen8 platforms, see ESXi host fails with intermittent NMI purple diagnostic screen on HP ProLiant Gen8 servers (2149043). VMware is no longer recommending that the Intel® VT-d interrupt remapper be disabled to work around the Intel® VT-d erratum described in this article. VMware is recommending that the fix for the erratum be applied in the BIOS as described in the Intel® specification updates for the affected processors.