Sunday, January 8, 2012

Troubleshooting 0x9C MACHINE_CHECK_EXCEPTION

The Debugging Tools for Windows are required to analyze crash dump files. If you do not have the Debugging Tools for Windows installed or dump files are not being generated on system crash, see this post for installation/configuration instructions:

http://mikemstech.blogspot.com/2011/11/windows-crash-dump-analysis.html

0x0000009C MACHINE_CHECK_EXCEPTION is an error that primarily occurs on older versions of the Windows platform (Windows XP, Windows Server 2003, and before). This error has been replaced by 0x00000124 WHEA_UNCORRECTABLE_ERROR on newer versions of the Windows platform (Windows Vista, Windows Server 2008, Windows 7, Windows Server 2008 R2, and Windows 8), but still appears in a couple of rare cases (specifically when WHEA is not fully initialized or when a special processor issue on SMP systems occurs that is characterized by a failure with shared memory synchronization. Microsoft phrases this as "All processors that rendezvous have no errors in their registers").

This is typically considered a serious hardware error (typically with the motherboard or the processor) when it appears, but can also be found when the system's BIOS is out of date relative to the rest of the drivers that are running the system.

Debugging a dump with windbg/kd yields some interesting information,

kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

MACHINE_CHECK_EXCEPTION (9c)
A fatal Machine Check Exception has occurred.
KeBugCheckEx parameters;
    x86 Processors
        If the processor has ONLY MCE feature available (For example Intel
        Pentium), the parameters are:
        1 - Low  32 bits of P5_MC_TYPE MSR
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of P5_MC_ADDR MSR
        4 - Low  32 bits of P5_MC_ADDR MSR
        If the processor also has MCA feature available (For example Intel
        Pentium Pro), the parameters are:
        1 - Bank number
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
        4 - Low  32 bits of MCi_STATUS MSR for the MCA bank that had the error
    IA64 Processors
        1 - Bugcheck Type
            1 - MCA_ASSERT
            2 - MCA_GET_STATEINFO
                SAL returned an error for SAL_GET_STATEINFO while processing MCA.
            3 - MCA_CLEAR_STATEINFO
                SAL returned an error for SAL_CLEAR_STATEINFO while processing MCA.
            4 - MCA_FATAL
                FW reported a fatal MCA.
            5 - MCA_NONFATAL
                SAL reported a recoverable MCA and we don't support currently
                support recovery or SAL generated an MCA and then couldn't
                produce an error record.
            0xB - INIT_ASSERT
            0xC - INIT_GET_STATEINFO
                  SAL returned an error for SAL_GET_STATEINFO while processing 
                  INIT event.
            0xD - INIT_CLEAR_STATEINFO
                  SAL returned an error for SAL_CLEAR_STATEINFO while processing 
                  INIT event.
            0xE - INIT_FATAL
                  Not used.
        2 - Address of log
        3 - Size of log
        4 - Error code in the case of x_GET_STATEINFO or x_CLEAR_STATEINFO
    AMD64 Processors
        1 - Bank number
        2 - Address of MCA_EXCEPTION structure
        3 - High 32 bits of MCi_STATUS MSR for the MCA bank that had the error
        4 - Low  32 bits of MCi_STATUS MSR for the MCA bank that had the error
Arguments:
Arg1: 00000000
Arg2: 8054e170
Arg3: b2000000
Arg4: 1040080f

Debugging Details:
------------------

   NOTE:  This is a hardware error.  This error was reported by the CPU
   via Interrupt 18.  This analysis will provide more information about
   the specific error.  Please contact the manufacturer for additional
   information about this error and troubleshooting assistance.

   This error is documented in the following publication:

      - IA-32 Intel(r) Architecture Software Developer's Manual 
        Volume 3: System Programming Guide

   Bit Mask:

       MA                           Model Specific       MCA
    O  ID      Other Information      Error Code     Error Code
   VV  SDP ___________|____________ _______|_______ _______|______
   AEUECRC|                        |               |              |
   LRCNVVC|                        |               |              |
   ^^^^^^^|                        |               |              |
      6         5         4         3         2         1
   3210987654321098765432109876543210987654321098765432109876543210
   ----------------------------------------------------------------
   1011001000000000000000000000000000010000010000000000100000001111


VAL   - MCi_STATUS register is valid
        Indicates that the information contained within the IA32_MCi_STATUS
        register is valid.  When this flag is set, the processor follows the
        rules given for the OVER flag in the IA32_MCi_STATUS register when
        overwriting previously valid entries.  The processor sets the VAL 
        flag and software is responsible for clearing it.

UC    - Error Uncorrected
        Indicates that the processor did not or was not able to correct the 
        error condition.  When clear, this flag indicates that the processor
        was able to correct the error condition.

EN    - Error Enabled
        Indicates that the error was enabled by the associated EEj bit of the
        IA32_MCi_CTL register.

PCC   - Processor Context Corrupt
        Indicates that the state of the processor might have been corrupted
        by the error condition detected and that reliable restarting of the
        processor may not be possible.

BUSCONNERR - Bus and Interconnect Error   BUS{LL}_{PP}_{RRRR}_{II}_{T}_err
        These errors match the format 0000 1PPT RRRR IILL



   Concatenated Error Code:
   --------------------------
   _VAL_UC_EN_PCC_BUSCONNERR_F

   This error code can be reported back to the manufacturer.
   They may be able to provide additional information based upon
   this error.  All questions regarding STOP 0x9C should be
   directed to the hardware manufacturer.

BUGCHECK_STR:  0x9C_GenuineIntel

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  DRIVER_FAULT

LAST_CONTROL_TRANSFER:  from 806f48db to 80533846

SYMBOL_ON_RAW_STACK:  1

STACK_ADDR_RAW_STACK_SYMBOL: ffffffff8054e1dc

STACK_COMMAND:  dds FFFFFFFF8054E1DC-0x20 ; kb

STACK_TEXT:  
8054e1bc  00000000
8054e1c0  00000000
8054e1c4  00000000
8054e1c8  00000000
8054e1cc  00000000
8054e1d0  ffdffc50
8054e1d4  00000000
8054e1d8  ba4e9162 intelppm+0x2162
8054e1dc  00000000
8054e1e0  80550f40 nt!KiDoubleFaultStack+0x2cc0
8054e1e4  00000000
8054e1e8  80550f38 nt!KiDoubleFaultStack+0x2cb8
8054e1ec  00000000
8054e1f0  00000046
8054e1f4  00000000
8054e1f8  806efe18 hal!HalpClockInterrupt+0xe4
8054e1fc  00000000
8054e200  00000000
8054e204  00000000
8054e208  00321213
8054e20c  00000000
8054e210  00000000
8054e214  00000000
8054e218  00000000
8054e21c  00000000
8054e220  00000000
8054e224  00000000
8054e228  00000000
8054e22c  00000000
8054e230  00000000
8054e234  00000000
8054e238  00000000


FOLLOWUP_IP: 
intelppm+2162
ba4e9162 ??              ???

SYMBOL_NAME:  intelppm+2162

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: intelppm

IMAGE_NAME:  intelppm.sys

DEBUG_FLR_IMAGE_TIMESTAMP:  48025183

FAILURE_BUCKET_ID:  0x9C_GenuineIntel_intelppm+2162

BUCKET_ID:  0x9C_GenuineIntel_intelppm+2162

Followup: MachineOwner
---------

kd> lmvm intelppm
start    end        module name
ba4e7000 ba4efe00   intelppm T (no symbols)           
    Loaded symbol image file: intelppm.sys
    Image path: \SystemRoot\system32\DRIVERS\intelppm.sys
    Image name: intelppm.sys
    Timestamp:        Sun Apr 13 12:31:31 2008 (48025183)
    CheckSum:         0000C894
    ImageSize:        00008E00
    Translations:     0000.04b0 0000.04e4 0409.04b0 0409.04e4 
 
From this particular dump, we can tell that there was an error involving the bus. Analyzing Machine Check Architecture Error Codes involves looking at Volume 3 of the Intel Architecture Software Developer's Manual (or examining the appropriate developer's guide for other vendors, such as the Bios/Kernel Developer's Guides for AMD processors). This is an Intel chip, and in this case, bits 19-24 (the bits that indicate the model specific error) are 001000. This indicates an error of type BQ_DCU_WB_TYPE. Further research indicates that this is a failure of the processor to write a line back to memory. Still we don't know whether the processor or motherboard failed, and to determine this would require running vendor supplied diagnostic tools for the system or examining repeated errors.

Some troubleshooting steps for this error might include:
  • Verify that the CPU is specifically supported for the installed motherboard
  • Verify that the BIOS is up to date for the system
  • Disable any overclocking (or other abnormal timing/voltage modifications)
  • Reseat the processor, memory, and all power connections to the motherboard and connected components
  • Identify and resolve any cooling or power supply related issues (including abnormal voltage from a wall outlet)
  • Engage vendor and replace motherboard/CPU
Further isolation of this error involves looking at repeated crash dumps for a pattern and decoding the MCE error to see if one particular operation repeatedly fails.  Engaging the motherboard vendor to assist with troubleshooting would also be helpful, as the processor/motherboard likely needs to be replaced. The error may be able to be triggered under load using a stress test.

See Also
Windows Crash Dump Analysis
Stress Testing a CPU to Detect Hardware Failure
0x124 WHEA_UNCORRECTABLE_ERROR

No comments:

Post a Comment