Wrong iovDisableIR setting on ProLiant Gen8 might cause a PSOD

TL;DR: There’s a script at the bottom of the page that fixes the issue.

Some days ago, this HPE customer advisory caught my attention:

Advisory: (Revision) VMware – HPE ProLiant Gen8 Servers running VMware ESXi 5.5 Patch 10, VMware ESXi 6.0 Patch 4, Or VMware ESXi 6.5 May Experience Purple Screen Of Death (PSOD): LINT1 Motherboard Interrupt

And there is also a corrosponding VMware KB article:

ESXi host fails with intermittent NMI PSOD on HP ProLiant Gen8 servers

It isn’t clear WHY this setting was changed, but in VMware ESXi 5.5 patch 10, 6.0  patch 4, 6.0 U3 and, 6.5 the Intel IOMMU’s interrupt remapper functionality was disabled. So if you are running these ESXi versions on a HPE ProLiant Gen8, you might want to check if you are affected.

To make it clear again, only HPE ProLiant Gen8 models are affected. No newer (Gen9) or older (G6, G7) models.

Currently there is no resolution, only a workaround. The iovDisableIR setting must set to FALSE. If it’s set to TRUE, the Intel IOMMU’s interrupt remapper functionality is disabled.

To check this setting, you have to SSH to each host, and use esxcli  to check the current setting:

I have written a small PowerCLI script that uses the Get-EsxCli cmdlet to check all hosts in a cluster. The script only checks the setting, it doesn’t change the iovDisableIR setting.

Here’s another script, that analyzes and fixes the issue.

Wrong iovDisableIR setting on ProLiant Gen8 might cause a PSOD
5 (100%) 11 votes
Patrick Terlisten
Follow me

Patrick Terlisten

vcloudnine.de is the personal blog of Patrick Terlisten. Patrick has nearly 2 decades of experience in IT, especially in the areas infrastructure, cloud, automation and industrialization. Patrick was selected as VMware vExpert (2014 - 2016), as well as PernixData PernixPro.

Feel free to follow him on Twitter and/ or leave a comment.
Patrick Terlisten
Follow me

4 thoughts on “Wrong iovDisableIR setting on ProLiant Gen8 might cause a PSOD

  1. Patrick Long

    I worked extensively with HPE and VMware in late November and early December 2016 to identify and troubleshoot this issue – my testing showed that for us, this issue only appears for Gen8 servers with Intel Ivy Bridge procs(v2), and only sporadically when they were under significant load; our Gen8 servers with Intel Sandy Bridge (v0) procs were not affected but as always YMMV – I would (and did) follow the KB recommendation is to revert the setting to FALSE for ALL Gen8 servers. I would like to clarify that the change in the default setting to TRUE for iovDisableIR was in fact made PRIOR to 6.0 U3 as you indicated, it actually was changed in ESXi 6.0 Patch 4 build-4600944 released 2016-11-22. The HPE Advisory is correct in this regard, but the VMware KB also incorrectly states the change occurred for ESXi 6.0 in Update 3 and I have submitted a request to have it corrected. VMware advised me that the reason for this reversal of the default setting of iovDisableIR was that the prior default of FALSE “was causing issues with other vendors systems” although I could not get anything more specific out of them than that – they would not identify to me which vendors were affected or what issues were caused.

    I would think that this issue could be fixed with a Gen8 BIOS revision, as the latest available for my affected DL380p servers was released 7/1/2015, but I was not given any information by HPE to indicate that a new release was forthcoming or even being worked on so for now please follow the Advisory/KB recommendation.

    Reply
  2. Juan Fernandez

    Hi Patrick. Very useful information.

    Maybe a completely fool question. Gen9 servers are completely out of scope for the KB? In your opinion, can be considered in any way a best practice changing iovDisableIR to FALSE in Gen9 servers?

    Reply

Leave a Reply

Your email address will not be published. Required fields are marked *