Monday, January 2, 2012

Troubleshooting 0x101 CLOCK_WATCHDOG_TIMEOUT


The Debugging Tools for Windows are required to analyze crash dump files. If you do not have the Debugging Tools for Windows installed or dump files are not being generated on system crash, see this post for installation/configuration instructions:

0x00000101 CLOCK_WATCHDOG_TIMEOUT belongs to a class of errors that are considered 'hardware' errors on the Windows platform (Windows XP, Windows Server 2003, Windows Vista, Windows Server 2008, Windows 7, Windows Server 2008 R2, and Windows 8). These errors typically indicate a hardware failure or an impending hardware failure. In this case, this error indicates that a problem occurred with the interprocessor interrupt handling that is required for symmetric multiprocessing (SMP) systems.

These interrupts occur for a variety of reasons, but one of the most common is translation lookaside buffer (TLB) invalidation, which is used to keep memory caches synchronized between processors when multiple processors are performing options on the same memory segments. If this fails, then the memory becomes inconsistent between processors and corruption is likely. Interprocessor operations involving interrupts like TLB invalidations are considered critical for the correct functioning of the system and are taken very seriously when they are delayed or fail, resulting in an exception or a timeout (timeouts usually result in a CLOCK_WATCHDOG_TIMEOUT bug check). There isn't a lot to tell from a dump as the majority of the time the analysis is inconclusive as to whether hardware or software/firmware (system drivers, BIOS, etc) caused the issue. A sample dump output is below, isolation to a processor can be identified between dumps using the !cpuinfo command and the processor control block (PCRB) can be displayed using the !pcrb <processor_number> debugger command.

0: kd> !analyze -v
*                                                                             *
*                        Bugcheck Analysis                                    *
*                                                                             *

An expected clock interrupt was not received on a secondary processor in an
MP system within the allocated interval. This indicates that the specified
processor is hung and not processing interrupts.
Arg1: 0000000000000031, Clock interrupt time out interval in nominal clock ticks.
Arg2: 0000000000000000, 0.
Arg3: fffff88003164180, The PRCB address of the hung processor.
Arg4: 0000000000000002, 0.

Debugging Details:




PROCESS_NAME:  ccsvchst.exe


... : nt!KeBugCheckEx
... : nt! ?? ::FNODOBFM::`string'+0x4e2e
... : nt!KeUpdateSystemTime+0x377
... : hal!HalpHpetClockInterrupt+0x8d
... : nt!KiInterruptDispatchNoLock+0x163
... : nt!KeFlushMultipleRangeTb+0x260
... : nt!MiFlushTbAsNeeded+0x1d1
... : nt!MiAllocatePoolPages+0x4de
... : nt!ExpAllocateBigPool+0xb0
... : nt!ExAllocatePoolWithTag+0x82e
... : nt!ExAllocatePoolWithQuotaTag+0x56
... : nt!IopXxxControlFile+0xb1b
... : nt!NtDeviceIoControlFile+0x56
... : nt!KiSystemServiceCopyEnd+0x13
... : 0x73692e09



FOLLOWUP_NAME:  MachineOwner

MODULE_NAME: Unknown_Module

IMAGE_NAME:  Unknown_Image




Followup: MachineOwner

0: kd> !cpuinfo
CP  F/M/S Manufacturer  MHz PRCB Signature    MSR 8B Signature Features
 0 18,1,0 AuthenticAMD 1397 0000000000000000                   203b7dfe
0: kd> !prcb 0
PRCB for Processor 0 at fffff780ffff0000:
Current IRQL -- 13
Threads--  Current fffffa8009fa5aa0 Next fffffa8007960a10 Idle fffff80003217cc0
Processor Index 0 Number (0, 0) GroupSetMember 1
Interrupt Count -- 008b247c
Times -- Dpc    0000009d Interrupt 00000047 
         Kernel 0007cb5a User      00005bbf  

First, remove any overclocking or nonstandard timing on the system (this almost always causes more problems than any resulting performance gain is worth). Also check the system for thermal issues.

If this error is due to software (typically deadlocked), then the system drivers and BIOS need to be updated to the latest versions (this is likely a first troubleshooting step before assuming hardware).

If this error is due to failing hardware (processor is typically unresponsive), then a stress test may serve to confirm the error or trigger a definitive hardware crash (like 0x124 WHEA_UNCORRECTABLE_ERROR). Diagnostics should be performed to identify whether it is a failure of the processors or motherboard for the system and appropriate corrective/replacement actions should be taken.

See Also
Windows Crash Dump Analysis
Stress Testing a CPU To Detect Hardware Failure

No comments:

Post a Comment