Friday, July 30, 2010

WinDbg/KD: Debugging a Processor Cache Issue (0x124 WHEA_UNCORRECTABLE_ERROR)

The Debugging Tools for Windows are required to analyze crash dump files. If you do not have the Debugging Tools for Windows installed or dump files are not being generated on system crash, see this post for installation/configuration instructions:
http://mikemstech.blogspot.com/2011/11/windows-crash-dump-analysis.html

I recently helped a user out with a stop error involving the processor cache and I realized that there are not a lot of posts that detail the information that is included in this kind of small memory dump. Professionals who work intimately with kernel level structures and the physical processor and chipset know that there are typically two or three caches (referred to as L1, L2, and L3 caches) as well as the TLB (table lookaside buffer) cache.

When there is a failure in one of these caches, there will likely be a stop error (sometimes called a bugcheck error, after the function that generates the dump and safely brings down the system, KeBugCheckEx) and a resulting memory dump in C:\Windows\Minidump. These files end in .dmp and can be read with a couple of utilities. I use the Debugging Tools for Windows to view these files. Note that after installing the Debugging Tools for Windows, it may be necessary to configure symbols for the debuggers. In WinDbg this is done from the File ->  Symbol File Path menu item. Using the linked article, it is possible to use the Microsoft symbol server to get all of the necessary symbols for the OS and to use the generated .pdb files for custom projects to load the necessary symbols for debugging custom applications.

After all of the initial setup tasks, starting WinDbg from the start menu is a simple task. Loading a dump file can be accomplished by pressing Ctrl+D or from the file menu using the "Open Crash Dump" command.

Since cache failures are usually detected as hardware errors, the error code 0x00000124 (WHEA_UNCORRECTABLE_ERROR) is the stop code that is displayed when the system crashes and the small memory dump is created. This error only appears on Windows Vista and later (Windows Vista, Windows Server 2008, Windows 7, and Windows Server 2008 R2). Older Windows versions (Windows XP, Windows Server 2003) crash with 0x9C MACHINE_CHECK_EXCEPTION.

The universal way to start debugging a crash dump is with the !analyze -v command. This displays key information about the process that likely caused the fault, the stack trace leading up to the crash, and key information about the error. When I look at these types of errors, I also use the !cpuinfo command to get information about the processor(s) involved with the crash.

The !cpuinfo extension command can help identify the failing processor on a multicore/multi-CPU system, but successful interpretation of the output depends on vendor documentation and how the kernel interacts with the hardware. The main value in the command is that someone interpreting the dump can use the information to help identify the processor and propose updated drivers to try before replacing the CPU. The F/M/S is the Family/Model/Stepping information for the processor. This can usually be used to identify the processor in use. In this case, this is a Family 15 Model 107 Stepping 2 64-bit processor manufactured by AMD (likely the AMD Athlon Dual Core Processor 5050e).

Once it has been identified as a WHEA_UNCORRECTABLE_ERROR, it is fairly simple to see that Arg2 is a pointer to the WHEA_ERROR_RECORD structure describing the nature of the error. This can be further analyzed by using the errrec address command where address is the address denoted by Arg2. It is simple to see from Section 0 that this was a failure during a read operation of the L1 processor cache.

This error does not always indicate a failure in the processor, but can also be caused by problems in the BIOS, so before sending the CPU back to the manufacturer or purchasing a replacement, always ensure that all of the system drivers and the BIOS are up to date. You should perform a stress test on the CPU to help determine whether a hardware issue exists. For more information, see this post.

Loading Dump File 
[C:\Users\Administrator\Documents\Dumps\072910-21078-01\072910-21078-01.dmp]
Mini Kernel Dump File: Only registers and stack trace are available

Symbol search path is: SRV*C:\Symbols*http://msdl.microsoft.com/download/symbols
Executable search path is: 
Windows 7 Kernel Version 7600 MP (2 procs) Free x64
Product: WinNt, suite: TerminalServer SingleUserTS Personal
Built by: 7600.16539.amd64fre.win7_gdr.100226-1909
Machine Name:
Kernel base = 0xfffff800`02a13000 PsLoadedModuleList = 0xfffff800`02c50e50
Debug session time: Thu Jul 29 17:38:35.915 2010 (UTC - 6:00)
System Uptime: 0 days 20:28:58.649
Loading Kernel Symbols
...............................................................
................................................................
..................
Loading User Symbols
Loading unloaded module list
.....
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

Use !analyze -v to get detailed debugging information.

BugCheck 124, {0, fffffa8004b0f038, b6204000, 135}

Probably caused by : hardware

Followup: MachineOwner
---------

0: kd> !analyze -v
*******************************************************************************
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *
*******************************************************************************

WHEA_UNCORRECTABLE_ERROR (124)
A fatal hardware error has occurred. Parameter 1 identifies the type of error
source that reported the error. Parameter 2 holds the address of the
WHEA_ERROR_RECORD structure that describes the error conditon.
Arguments:
Arg1: 0000000000000000, Machine Check Exception
Arg2: fffffa8004b0f038, Address of the WHEA_ERROR_RECORD structure.
Arg3: 00000000b6204000, High order 32-bits of the MCi_STATUS value.
Arg4: 0000000000000135, Low order 32-bits of the MCi_STATUS value.

Debugging Details:
------------------


BUGCHECK_STR:  0x124_AuthenticAMD

CUSTOMER_CRASH_COUNT:  1

DEFAULT_BUCKET_ID:  VISTA_DRIVER_FAULT

PROCESS_NAME:  Wow.exe

CURRENT_IRQL:  f

STACK_TEXT:  
... : nt!KeBugCheckEx
... : hal!HalBugCheckSystem+0x1e3
... : nt!WheaReportHwError+0x263
... : hal!HalpMcaReportError+0x4c
... : hal!HalpMceHandler+0x9e
... : hal!HalHandleMcheck+0x47
... : nt!KxMcheckAbort+0x6c
... : nt!KiMcheckAbort+0x153
... : 0x698d668e


STACK_COMMAND:  kb

FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: hardware

IMAGE_NAME:  hardware

DEBUG_FLR_IMAGE_TIMESTAMP:  0

FAILURE_BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

BUCKET_ID:  X64_0x124_AuthenticAMD_PROCESSOR_CACHE

Followup: MachineOwner
---------

0: kd> !cpuinfo
CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features
 0 15,107,2 AuthenticAMD 3114 0000000000000000                   203b7dfe

0: kd> !errrec fffffa8004b0f038
===============================================================================
Common Platform Error Record @ fffffa8004b0f038
-------------------------------------------------------------------------------
Record Id     : 01cb2ecb7afdac79
Severity      : Fatal (1)
Length        : 928
Creator       : Microsoft
Notify Type   : Machine Check Exception
Timestamp     : 7/29/2010 23:38:35
Flags         : 0x00000000

===============================================================================
Section 0     : Processor Generic
-------------------------------------------------------------------------------
Descriptor    @ fffffa8004b0f0b8
Section       @ fffffa8004b0f190
Offset        : 344
Length        : 192
Flags         : 0x00000001 Primary
Severity      : Fatal

Proc. Type    : x86/x64
Instr. Set    : x64
Error Type    : Cache error
Operation     : Data Read
Flags         : 0x00
Level         : 1
CPU Version   : 0x0000000000060fb2
Processor ID  : 0x0000000000000000

===============================================================================
Section 1     : x86/x64 Processor Specific
-------------------------------------------------------------------------------
Descriptor    @ fffffa8004b0f100
Section       @ fffffa8004b0f250
Offset        : 536
Length        : 128
Flags         : 0x00000000
Severity      : Fatal

Local APIC Id : 0x0000000000000000
CPU Id        : b2 0f 06 00 00 08 02 00 - 01 20 00 00 ff fb 8b 17
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00
                00 00 00 00 00 00 00 00 - 00 00 00 00 00 00 00 00

Proc. Info 0  @ fffffa8004b0f250

===============================================================================
Section 2     : x86/x64 MCA
-------------------------------------------------------------------------------
Descriptor    @ fffffa8004b0f148
Section       @ fffffa8004b0f2d0
Offset        : 664
Length        : 264
Flags         : 0x00000000
Severity      : Fatal

Error         : DCACHEL1_DRD_ERR (Proc 0 Bank 0)
  Status      : 0xb620400000000135
  Address     : 0x0000000063c20ef0
  Misc.       : 0x0000000000000000

1 comment:

  1. I also came across this last June and you are right that there were not many posts that detail the information that was needed.But I appreciate you for sharing it as it might help others who need.
    digital certificates

    ReplyDelete