http://mikemstech.blogspot.com/2011/11/windows-crash-dump-analysis.html
0x00000116 VIDEO_TDR_FAILURE is an interesting blue screen of death (BSOD) because a lot of users encounter it, and in the vast majority of forum posts (even ones that I've answered previously), neither MVP (Microsoft Most Valued Professional) nor non-MVP have been able to successfully resolve the issue in most cases, or really even break from the "update your graphics drivers" or "update your BIOS" mantra. I decided to dig into the Windows Driver Kit and really determine what this error is actually saying and see if there is a better resolution or workaround. Let's start out by looking at a dump from May 2010 using the !analyze -v debugger command and the !sysinfo machineid command,
0: kd> !analyze -v ******************************************************************************* * * * Bugcheck Analysis * * * ******************************************************************************* VIDEO_TDR_FAILURE (116) Attempt to reset the display driver and recover from timeout failed. Arguments: Arg1: fffffa8004c50310, Optional pointer to internal TDR recovery context
(TDR_RECOVERY_CONTEXT). Arg2: fffff880101cc360, The pointer into responsible device driver module
(e.g. owner tag). Arg3: 0000000000000000, Optional error code (NTSTATUS) of the last failed
operation. Arg4: 0000000000000002, Optional internal context dependent data. Debugging Details: ------------------ FAULTING_IP: nvlddmkm+114360 fffff880`101cc360 803d393cb80000 cmp byte ptr
[nvlddmkm+0xc97fa0 (fffff880`10d4ffa0)],0 DEFAULT_BUCKET_ID: GRAPHICS_DRIVER_TDR_FAULT CUSTOMER_CRASH_COUNT: 1 BUGCHECK_STR: 0x116 PROCESS_NAME: System CURRENT_IRQL: 0 STACK_TEXT: ... : nt!KeBugCheckEx ... : dxgkrnl!TdrBugcheckOnTimeout+0xec ... : dxgkrnl!TdrIsRecoveryRequired+0x1a2 ... : dxgmms1!VidSchiReportHwHang+0x40b ... : dxgmms1!VidSchiCheckHwProgress+0x71 ... : dxgmms1!VidSchiWaitForSchedulerEvents+0x1fb ... : dxgmms1!VidSchiScheduleCommandToRun+0x1da ... : dxgmms1!VidSchiWorkerThread+0xba ... : nt!PspSystemThreadStartup+0x5a ... : nt!KxStartSystemThread+0x16 STACK_COMMAND: .bugcheck ; kb FOLLOWUP_IP: nvlddmkm+114360 fffff880`101cc360 803d393cb80000 cmp byte ptr
[nvlddmkm+0xc97fa0 (fffff880`10d4ffa0)],0 SYMBOL_NAME: nvlddmkm+114360 FOLLOWUP_NAME: MachineOwner MODULE_NAME: nvlddmkm IMAGE_NAME: nvlddmkm.sys DEBUG_FLR_IMAGE_TIMESTAMP: 4baa0110 FAILURE_BUCKET_ID: X64_0x116_IMAGE_nvlddmkm.sys BUCKET_ID: X64_0x116_IMAGE_nvlddmkm.sys Followup: MachineOwner ---------
0: kd> !sysinfo machineid Machine ID Information [From Smbios 2.5, DMIVersion 0, Size=1323] BiosVendor = Alienware BiosVersion = A04 BiosReleaseDate = 04/28/2010 SystemManufacturer = Alienware SystemProductName = M17x SystemVersion = A0423 BaseBoardManufacturer = Alienware BaseBoardProduct = BaseBoardVersion = A04
We can tell one thing: We had a crash (duh...). The BIOS is relatively new (for the time) and the NVIdia drivers were likely up to date when this system crashed (we can use the lm vm nvlddmkm command to see this),
0: kd> lmvm nvlddmkm start end module name fffff880`100b8000 fffff880`10de4d80 nvlddmkm T (no symbols) Loaded symbol image file: nvlddmkm.sys Image path: \SystemRoot\system32\DRIVERS\nvlddmkm.sys Image name: nvlddmkm.sys Timestamp: Wed Mar 24 06:09:52 2010 (4BAA0110) CheckSum: 00D3D0C3 ImageSize: 00D2CD80 Translations: 0000.04b0 0000.04e4 0409.04b0 0409.04e4
So now we need to really dig into the error and figure out what windbg/kd are trying to tell us about what happened. Microsoft's one-liner description is fairly cryptic: "Attempt to reset the display driver and recover from timeout failed." So what does this mean?
One of the complaints with Windows (or really any other operating system) is that the screen freezes from time to time. If the screen freezes for more than a few seconds, users are likely to hard reset the machine that they are working on. This seems natural, but in this case the system is still responsive. The graphics processing unit (GPU) is busy processing something (possibly a game, 3D render, or even Windows Aero) and is not actively refreshing the screen.
In Windows Vista SP1 and Windows Server 2008 SP1 Microsoft introduced a feature to help catch and correct this behavior using a feature called "Timeout Detection and Recovery (TDR)." The TDR feature works to identify whether the graphics processor is hung (the default timeout is 2 seconds), and if it is, it prepares to reset the graphics processor and the relevant part of the graphics stack. During this process, it tells the driver not to access the hardware or memory and gives it a short time for currently running threads to leave the driver. If the threads do not leave within the timeout, then the system bug checks with 0x116 VIDEO_TDR_FAILURE. The system can also bug check with VIDEO_TDR_FAILURE if a number of TDR events occur in a short period of time (the default is 5 TDRs in 1 minute). If the TDR is successful, then the user may receive a bubble that says "Display driver stopped responding and has recovered."
There are some registry keys that can be used to control the behavior of TDR and may be useful for additional testing and troubleshooting. These keys are all in HKEY_LOCAL_MACHINE\System\CurrentControlSet\Control\GraphicsDrivers. These keys/values may need to be created if they are not there, but if they are missing then they are using the default value. A full list is here on MSDN, but I will explain a couple that are likely to be the most applicable. The usual warnings apply with editing the registry: Be sure that you know what you are doing and if you get into trouble, either restore from backup or try to back out the change in Safe Mode.
The first key determines whether TDR is enabled or disabled and what it actually does when it detects a timeout:
Value Name | TdrLevel |
Type | REG_DWORD |
Possible Values | 0 - Detection is Disabled 1 - Bug Check on Timeout 3 - Recover on Timeout (default) |
This value may be useful to disable TDR. This would be done in the case that the graphics hardware and display adapter simply do not play nicely with TDR and that the GPU/Driver will recover on their own. Ultimately, if the driver/GPU don't recover after a hang, then the system will appear to be frozen and will not bug check on its own. This is the main registry key that I think might be helpful, but I'll also mention a couple of others.
TdrDelay (REG_DWORD, default value = 2) is used to change the timeout period from 2 seconds to a different number of seconds. This would be useful in the instance that the GPU takes 3 seconds to recover (instead of 2).
TdrLimitTime (REG_DWORD, default value = 60) and TdrLimitCount (REG_DWORD, default value = 5) changes the behavior of allowing a smaller number or larger number of TDRs in a specific time period. The main usefulness here would be if the crash can be tuned out of the system by adjusting these parameters.
Other ideas for troubleshooting:
- Disable any overclocking on the system or graphics processor
- Ensure that your power supply is sufficient to handle the motherboard, processors, video card, and all other devices
- Ensure that all power connections are firmly in place on the motherboard and video card (some video cards require additional power and have specific ports that need to receive direct power from the power supply)
- Verify that the video card is fully inserted and secured
- Verify that no other wires or materials are laying on the video card
- Verify that the system and video card are adequately cooled, overheating graphics cards can cause serious hangs/crashes.
- Verify that DirectX and OpenGL are up to date and any graphics intensive applications (such as games) are fully patched
See Also,
Windows Crash Dump Analysis
Stress Testing a Video Card
Thanks for the tip. I have a brand new machine with new AMD video and am getting 116 and 117 BSODs occasionally.
ReplyDeleteI'm going to disagree with you that the blame shouldn't be on Microsoft. There has to be a better way to handle this condition than a BSOD. Great way to lose data!
Unfortunately, the way it works, the two options are for the system to crash or appear to be totally unresponsive...
Delete