(cache)Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)

search cancel

Decoding Machine Check Error (MCE) output after an ESXi panic (Purple Screen)

book

Article ID: 374985

calendar_today

Updated On: 08-21-2024

Products

VMware vSphere ESX 6.x VMware vSphere ESX 7.x VMware vSphere ESX 8.x

Issue/Introduction

An ESXi host halts with a purple diagnostic screen, The purple diagnostic screen shows a message similar to:

Machine Check Exception on PCPU42 in world 10021342342
System has encountered a Hardware Error - Please contact the hardware vendor

Environment

ESXi 6.x
ESXi 7.x
ESXi 8.x

Cause

The machine check architecture is a mechanism within a CPU to detect and report hardware issues. When a problem is detected, a Machine Check Exception (MCE) is thrown. If an MCE is thrown and a purple diagnostic screen displays, a hardware problem has caused it. There is no other way to generate an MCE.

When the system has faults with a purple screen, capture the screen output, then reboot the server and contact your hardware vendor. In the meantime, the information regarding the fault itself can be decoded to get a better idea of what may be happening.

Resolution

When you see an MCE purple diagnostic screen, take a screenshot, reboot, and collect the logs.

Recent CPUs from Intel and AMD implement a machine-check architecture that detects and reports hardware issues, including system bus errors, RAM (ECC and parity) errors, and other CPU errors. There are a set of model-specific registers (MSRs) that are used to report errors.When a hardware error occurs, global and bank-specific status machine-check architecture registers are populated with information regarding the cause, and whether the CPU can safely continue execution. In the case of a correctable error, ESXi reports the incident and register contents in the VMkernel logs. If an error is uncorrectable, and the CPU cannot continue safely, ESXi halts with a purple diagnostic screen.During an MCE, the contents of the machine-check architecture registers are logged. The messages appear on the purple diagnostic screen itself and are recorded in the log file within the VMkernel zdump file. If serial-line logging is configured, the same messages are emitted on the serial port.

Machine-Check Architecture Registers:

The global MCA register (MCG_STATUS) reports whether an MCE is in progress, and if the instruction pointer pushed on to the stack can be used to reliably restart program execution or is directly associated with the error.The global capabilities (MCG_CAP) register identifies the capabilities of the machine-check architecture of the processor. The lower 8 bits specify the number of hardware-unit error-reporting banks present in a particular processor. A bank of error-reporting registers are associated with a specific (or group of) hardware unit(s), though the association is vendor-and model-specific. For more information, see the vendor documentation listed in the Additional Information section of this article.Each error-reporting bank is comprised of several registers. Of primary interest during a machine check exception is the status register (MCi_STATUS) of the bank, which contains detailed information regarding the machine check exception, and the address (MCi_ADDR) and miscellaneous (MCi_MISC) registers, which may provide additional information.

Identifying register contents

Different versions of ESXi log the machine-check architecture register contents using different formats.

Regardless of the version of ESXi, these items of information should be available:

Physical CPU number
Global status register
Bank number
Bank status register
Bank address register
Bank miscellaneous register

ESXi 6.5 and later:

The log message consists of one line for each bank of interest, including the physical CPU number, the text "MCA:", the error class, how the error was reported, the MCG_STATUS register (G), the bank number (B), the MCi_STATUS register (S), the MCi_ADDR register (A), the MCi_MISC register (M), the decoded system physical address and size (P) in 6.7 and later, and a human-readable interpretation of the error.

cpu42:...)ALERT: MCA: ...: UC Excp G5 B1 Sbf80000000000114 Aaf9e74900 M86 Paf9e74900/40 Cache Hierarchy: Level 0 Data Cache Read Error.

The error class may be one of the following:

UC: Uncorrected, unrecoverable
SRAR: Uncorrected, recoverable, action required (Intel)
SRAO: Uncorrected, recoverable, action optional (Intel)
UCNA: Uncorrected, no action required (Intel)
UCR: Uncorrected, recoverable (AMD)
CE: Corrected
DE: Deferred (AMD)

How the error was reported may be one of the following:

Init: Found during boot-time initialization (possibly from prior to the reboot)
Poll: Periodic polling of the MCA banks
Excp: Machine Check Exception handler
Intr: Corrected Machine Check Interrupt handler

Automatic Interpretation:

VMware ESXi attempts to interpret the contents of the status register(s) for display in the log and on the purple diagnostic screen.

For example:

Cache Hierarchy: Level 0 Data Cache Read Error.
Bus error, node originated, read, memory access

Note: Where the automatic interpretation and vendor interpretation disagree, the interpretation of the vendor should be taken as correct. The raw contents of the status registers are also available, so they can be manually reviewed.

Decoding the global MCA status (MCG_STATUS) register.

The global status register is 64-bits, but only the low 3 bits have meaning. The high 61 bits are reserved. The global status register can be converted to binary for comparing.

63	3	2	1	0
Reserved		MCIP	EIPV	RIPV

Bit 2: Machine Check In Progress. Identifies whether a machine check is in progress, and whether further fields should be consulted.
Bit 1: Error IP Valid. Identifies whether the instruction pointer pushed on to the stack is directly related to the error.
Bit 0: Restart IP Valid. Identifies whether the program execution can be reliably restarted at the instruction pointer pushed on to the stack.

For example, the global status register value "5" is equal to 0101 in binary. This translates to MCIP=1, EIPV=0, RIPV=1, which indicates that there is a machine check in progress, and the Restart IP is valid.

Overview of the bank status (MCi_STATUS) register

Each bank's MCi_STATUS register contains information related to a machine-check error. This information is only meaningful and logged if the Valid flag (bit 63) is set. This register is 64-bits wide.

63	62	61	60	59	58	57	56	32	31	16	15	0
VAL	OVER	UC	EN	MISCV	ADDRV	PCC	Other Information		Extended Error Code		MCA Error Code

The high 7 bits 57:63 provide an overview of the processor state, and which of the other registers are meaningful:

Bit 63: VAL. Indicates (when set) that this bank's status (MCi_STATUS) register is valid, and that further fields should be consulted.
Bit 62: OVER. Indicates (when set) that a machine-check error occurred while the results of a previous error were still in the error-reporting register bank. May indicate that ESXi has not processed the MCE promptly, or that multiple MCEs occurred very close together.
Bit 61: UC. Indicates (when set) that the processor did not, or was not able to, correct the error condition. An ESXi host always generates a purple diagnostic screen when the processor indicates that the error condition was uncorrectable.
Bit 60: EN. Indicates (when set) that the error was enabled by the associated EEj bit of the MCi_CTL register. Will generally be 1.
Bit 59: MISCV. Indicates (when set) that the associated miscellaneous register (MCi_MISC) for this bank is valid, and contains additional information regarding the error.
Bit 58: ADDRV. Indicates (when set) that the associated address register (MCi_ADDR) for this bank is valid, and contains the memory address where the error occurred. Memory address may be physical or virtual, and dependent on the type of error encountered.
Bit 57: PCC. Indicates (when set) that the state of the processor may have been corrupted by the error condition, and that it may not be possible to reliably resume software execution.

Bits 56:32 contain other information, which may be reserved, used for counters, or hold other information that is model-specific.

Bits 31:16 contain a model-specific extended error code.

Bits 15:0 contains the machine-check architecture-defined error code for the machine-check error condition detected. These error codes are the same for all processors which implement the machine-check architecture, though individual processor models may define additional nuance.

Machine-check architecture-defined error codes in the bank status (MCi_STATUS) register

The machine-check architecture defines several errors which may be present in any bank's status register, grouped into Simple and Compound error codes. Identify the pattern which matches the contents of the status register.

Simple Error Codes reflect a specific fault, exactly matching the contents of the status register:

0000 0000 0000 0000 – No error has been reported to this bank.
0000 0000 0000 0001 – Unclassified. This error has not been classified into the MCA error classes. The additional information section may have meaning.
0000 0000 0000 0010 – Parity error in internal microcode ROM
0000 0000 0000 0011 – The BINT# from another processor caused this processor to enter machine-check.
0000 0000 0000 0100 – Functional redundancy check (FRC) master/slave error.
0000 0000 0000 0101 – Internal parity error.
0000 0100 0000 0000 – Internal timer error.
0000 01xx xxxx xxxx – Internal unclassified error. At least one x equals 1

Compound Error Codes follow a pattern, and define multiple aspects of the error with a single error number:

000F 0000 0000 11LL – Generic cache hierarchy errors.
000F 0000 0001 TTLL – TLB errors.
000F 0000 1MMM CCCC – Memory controller errors (Intel-only).
000F 0001 RRRR TTLL – Memory errors in the cache hierarchy.
000F 1PPT RRRR IILL – Bus and interconnect errors.

Compound Error Code sub-fields define sections of a compound error code. Use these to populate the template defined by the compound error code:

Encoding of Transaction Type (TT) sub-field:
- 00 – Instruction
- 01 – Data
- 10 – Generic
- 11 – Reserved
Encoding of Memory Hierarchy Level (LL) sub-field:
- 00 – Level 0
- 01 – Level 1
- 10 – Level 2
- 11 – Generic
Encoding of memory transaction type (MMM) sub-field:
- 000 – Generic undefined request
- 001 – Memory read error
- 010 – Memory write error.
- 011 – Address or command error.
- 100 – Memory scrubbing error.
- 101-111 – Reserved.
Encoding of channel number (CCCC) sub-field:
- 0000-1110 – Channel number.
- 1111 – Channel not specified.
Encoding of Request (RRRR) sub-field:
- 0000 – Generic error
- 0001 – Generic read
- 0010 – Generic write
- 0011 – Data read
- 0100 – Data write
- 0101 – Instruction fetch
- 0110 – Prefetch
- 0111 – Evict
- 1000 – Snoop (probe)
Encoding of Participation Processor (PP) sub-field:
- 00 – Local node originated the request.
- 01 – Local node responded to the request.
- 10 – Local node observed error as third-party.
- 11 – Generic
Encoding of Timeout (T) sub-field:
- 0 – Request did not timeout.
- 1 – Request did timeout.
Encoding of Memory/IO (II) sub-field:
- 00 – Memory access
- 01 – Reserved
- 10 – I/O
- 11 – Other

Model-specific error codes in the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers:

The machine-check architecture allows for bits or groups of bits within the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers to take on additional meaning based on the processor model and the bank number. Listing the field meanings for all processor families is outside the scope of this article.

To interpret the additional contents of the bank status (MCi_STATUS) and miscellaneous (MCi_MISC) registers, review the documentation for the specific processor model.

Other considerations

Information reported by the machine-check architecture provides aid in troubleshooting a hardware issue. However, the information available from the MCA error code may be insufficient to root-cause the issue. If more information is required, refer to the processor documentation from the manufacturer.
Information reported by the machine-check architecture should be considered in context of other errors when attempting to determine a pattern of outages.
If the machine-check architecture reports invalid information, but an MCE has occurred, this is still reflective of a hardware fault.
Providing the full machine-check architecture register contents to the hardware vendor may assist their investigation into the cause of the hardware fault.

Additional Information

Some kinds of machine check errors do not cause ESXi to panic.
Corrected errors:
Some errors are completely corrected by hardware, such as memory errors that are corrected by Error Correcting Code (ECC) hardware, but the hardware may still report them to ESXi for advisory reasons.
Recoverable errors:
Some errors cannot be corrected by hardware, but can still be recovered from by terminating the task that encountered the error. For example, when a memory error is too severe to be corrected by ECC hardware, it may still be possible for the system to terminate only the virtual machine or process that was using the corrupted data, while allowing other virtual machines and processes to continue running. In other cases, however, an error that is recoverable in theory cannot actually be recovered from because the ESXi kernel was using the corrupted data, so ESXi still must panic.

Both corrected errors and recoverable errors appear in the vmkernel log and can be decoded using the instructions in this article. If a virtual machine or other process had to be terminated as part of recovery, the details generally are logged as well.

For more information, see:
Intel - Chapters 15 and 16 of the Intel 64 and IA-32 Architectures Software Developer's Manual.

Feedback

Was this article helpful?

thumb_up Yes

thumb_down No