Sunday, February 5, 2012

Identifying System Cooling Issues

Even the best of us have computer problems from time to time. In my case, I had a problem that resulted from the combination of a lazy system setup (I put a few too few fans in my case) and a dead fan. Since I write a lot of posts on a number of different topics, I build new virtual machines on a weekly basis to demonstrate different features and application configurations in Windows and Linux. In this case, I was working on building a couple of MIT Kerberos servers to demonstrate how to easily apply password policies using MIT Kerberos 5 and how to build an older version and newer version of Kerberos 5 on a newer Fedora build (Fedora 15) and on Ubuntu 11.10.

I had a peculiar problem while I was working on this, I would start two Linux virtual machines that I built on Hyper-V, do some work, then go to bed. When I woke up in the morning, my custom built server would be shut down. The first few times I wrote it off as issues potentially caused by the weather (various wind storms and snow storms here in Boulder) or by the low quality of our power infrastructure (since we have Xcel Energy, our power is about as reliable as my Comcast Internet connection [in the industry we call this 1.5 nines of uptime]). After the first couple of times, I started to wonder because none of our other household appliances (microwave, stove, etc) were showing signs of a power failure. Since it was happening more frequently than normal, I started to wonder if I had a problem with my server... so I started troubleshooting...

Nothing really obvious jumped out from the logs, simply the kernel power message with event ID 41 (in this case no BugCheck information and no dump files, so probably not a blue screen). This really only indicates that the system turned off in an unsupported way (possibly due to a failing power supply, overheating system, or other power fluctuation/issue).

Log Name:      System
Source:        Microsoft-Windows-Kernel-Power
Date:          2/1/2012 6:44:49 AM
Event ID:      41
Task Category: (63)
Level:         Critical
Keywords:      (2)
User:          SYSTEM
Computer:      WIN-BB9Q000LTK1
Description:
The system has rebooted without cleanly shutting down first. 
This error could be caused if the system stopped responding, crashed, or 
lost power unexpectedly.
Event Xml:
<Event xmlns="http://schemas.microsoft.com/win/2004/08/events/event">
  <System>
    <Provider Name="Microsoft-Windows-Kernel-Power" 
        Guid="{331C3B3A-2005-44C2-AC5E-77220C37D6B4}" />
    <EventID>41</EventID>
    <Version>2</Version>
    <Level>1</Level>
    <Task>63</Task>
    <Opcode>0</Opcode>
    <Keywords>0x8000000000000002</Keywords>
    <TimeCreated SystemTime="2012-02-01T13:44:49.171875000Z" />
    <EventRecordID>77389</EventRecordID>
    <Correlation />
    <Execution ProcessID="4" ThreadID="8" />
    <Channel>System</Channel>
    <Computer>WIN-BB9Q000LTK1</Computer>
    <Security UserID="S-1-5-18" />
  </System>
  <EventData>
    <Data Name="BugcheckCode">0</Data>
    <Data Name="BugcheckParameter1">0x0</Data>
    <Data Name="BugcheckParameter2">0x0</Data>
    <Data Name="BugcheckParameter3">0x0</Data>
    <Data Name="BugcheckParameter4">0x0</Data>
    <Data Name="SleepInProgress">false</Data>
    <Data Name="PowerButtonTimestamp">0</Data>
  </EventData>
</Event>  

I remembered back to a time when I had a really crappy laptop built     by a company called IBuyPower (I wrote this company a BBB complaint,     FTC complaint, and nearly filed a lawsuit due to their crappy     system). This laptop would constantly overheat and shut itself off     and literally spent more time in transit and in RMA than I had it...     but that's in the past now...

I had never had a thermal issue with the custom-built server before, so I thought it was a long shot. I downloaded the Open Hardware Monitor and was shocked to see some of the numbers that came off when I was running the virtualization load. I had one of the processors hot enough to boil water (100 degrees Celsius):



I immediately killed the virtualization load and shut the server down until I could investigate the cause of the issue since I was close to damaging the system. That night I identified that one of the CPU fans had died and needed to be replaced. Since I hadn't done a good job with cooling the case before, I decided to replace all of the fans in the case (and get a few additional ones that had spaces in the case, but no fans at the time). Since I was somewhat concerned about cooling (more than noise), I went for the fans on NewEgg that had the highest air displacement (3x 80 mm, 1x 120 mm, and 2x 92 mm). I also bought a fan controller since most of the reviews for the fans that I bought stated that they were too loud for a household environment.

After waiting for the next day shipping, I tackled an adventure of splicing 3 pin fan connectors to the wires on the fans that connected to a 4 pin molex connector (since I couldn't find the right adapter on the Internet and the fan controller only had 3 pin connections). Under moderate virtualization load, I was able to reduce the maximum temperature by 40 degrees into a far more acceptable range (and no further failures yet).



The moral of the story is that thermal issues can easily sneak up on you and are often overlooked as a potential cause of issues involving unexpected shutdowns and blue screens of death. When troubleshooting these issues, be sure to investigate thermal issues before blindly replacing components. Examine the manufacturer's documentation for recommended temperature ranges.

Ensure that the changes that you make have a measurable impact on the temperature of the system (as they did in my case... no pun intended).

See Also,
Windows Crash Dump Analysis

No comments:

Post a Comment