An ESXi host halts with a purple diagnostic screen, The purple diagnostic screen shows a message similar to:
ESXi 6.x
ESXi 7.x
ESXi 8.x
The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a Machine Check Exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen displays, a hardware problem has caused it. There is no other way to generate an MCE.
When the system has faults with a purple screen, capture the screen output, then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.
The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.
Different versions of ESXi log the machine-check architecture register contents using different formats.
Regardless of the version of ESXi, these items of information should be available:
The log message consists of one line for each bank of interest, including the physical CPU number, the text "MCA:", the error class, how the error was reported, the MCG_STATUS register (G), the bank number (B), the MCi_STATUS register (S), the MCi_ADDR register (A), the MCi_MISC register (M), the decoded system physical address and size (P) in 6.7 and later, and a human-readable interpretation of the error.
cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/40 Cache Hierarchy: Level 0 Data Cache Read Error.
The error class may be one of the following:
How the error was reported may be one of the following:
The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.
63 | 3 | 2 | 1 | 0 |
Reserved | MCIP | EIPV | RIPV |
For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.
Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.
63 | 62 | 61 | 60 | 59 | 58 | 57 | 56 | 32 | 31 | 16 | 15 | 0 |
VAL | OVER | UC | EN | MISCV | ADDRV | PCC | Other Information | Extended Error Code | MCA Error Code |
Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific.
Bits 31:16 contain a model-specific extended error code.
Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance.
The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.
Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:
0000 0000 0000 0000
– No error has been reported to this bank.0000 0000 0000 0001
– Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.0000 0000 0000 0010
– Parity error in internal microcode ROM0000 0000 0000 0011
– The BINT# from another processor caused this processor to enter machine-check.0000 0000 0000 0100
– Functional redundancy check (FRC) master/slave error.0000 0000 0000 0101
– Internal parity error.0000 0100 0000 0000
– Internal timer error.0000 01xx xxxx xxxx
– Internal unclassified error. At least one x equals 1Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:
000F 0000 0000 11LL
– Generic cache hierarchy errors.000F 0000 0001 TTLL
– TLB errors.000F 0000 1MMM CCCC
– Memory controller errors (Intel-only).000F 0001 RRRR TTLL
– Memory errors in the cache hierarchy.000F 1PPT RRRR IILL
– Bus and interconnect errors.Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:
00
– Instruction01
– Data10
– Generic11
– Reserved00
– Level 001
– Level 110
– Level 211
– Generic000
– Generic undefined request001
– Memory read error010
– Memory write error.011
– Address or command error.100
– Memory scrubbing error.101-111
– Reserved.0000-1110
– Channel number.1111
– Channel not specified.0000
– Generic error0001
– Generic read0010
– Generic write0011
– Data read0100
– Data write0101
– Instruction fetch0110
– Prefetch0111
– Evict1000
– Snoop (probe)00
– Local node originated the request.01
– Local node responded to the request.10
– Local node observed error as third-party.11
– Generic0
– Request did not timeout.1
– Request did timeout.00
– Memory access01
– Reserved10
– I/O11
– OtherThe machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.
To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model.
Some kinds of machine check errors do not cause ESXi to panic.
Corrected errors:
Some errors are completely corrected by hardware, such as memory errors that are corrected by Error Correcting Code (ECC) hardware, but the hardware may still report them to ESXi for advisory reasons.
Recoverable errors:
Some errors cannot be corrected by hardware, but can still be recovered from by terminating the task that encountered the error. For example, when a memory error is too severe to be corrected by ECC hardware, it may still be possible for the system to terminate only the virtual machine or process that was using the corrupted data, while allowing other virtual machines and processes to continue running. In other cases, however, an error that is recoverable in theory cannot actually be recovered from because the ESXi kernel was using the corrupted data, so ESXi still must panic.
Both corrected errors and recoverable errors appear in the vmkernel log and can be decoded using the instructions in this article. If a virtual machine or other process had to be terminated as part of recovery, the details generally are logged as well.
For more information, see:
Intel - Chapters 15 and 16 of the Intel 64 and IA-32 Architectures Software Developer's Manual.