Windows ConfidentialWork Harder, Not Smarter

Raymond Chen

For software developers, one of the great benefits of Windows Error Reporting is that they can get real-world information about how their programs fail. These error reports are available to any software vendor who registers online at winqual.microsoft.com. Sure, you can run all sorts of tests in your labs, but there is always a gap between lab conditions and real-world conditions. This can result in bugs that never show up in your testing but that drive your customers crazy.

I was told a story a few years ago by a programmer who was assigned a task from his product's reliability team. The team wanted to address the top reasons for crashes and hangs in a particular component. After some negotiation, both parties agreed to set a goal of reducing the number of these types of failures by a factor of 2. They wanted to fix bugs that are responsible for at least 50 percent of the crashes and hangs in that component, as measured by Windows Error Reporting.

We all know that failures are not uniformly distributed. Some are rare, others are common, and still others are rampant (relatively speaking). Engineering is a game of trade-offs; given finite resources (time, money, and brainpower), you have to dedicate your resources to where they will have the greatest effect. Since the goal here was to reduce the number of crashes and hangs, the sensible approach was to study the most common failures reported by Windows Error Reporting, understand the root causes, and then try to fix these common failures.

fig00.gif

Windows Error Reporting provides real-world information on program failures (Click the image for a larger view)

When the programmer dug into the failure data, it became clear that five particular crashes and hangs accounted for more than 60 percent of the reports. If the programmer could fix the bugs that were causing these five failures, he'd reduce the number of crashes and hangs by a factor of 2.5, well above the target factor of 2.

Microsoft's own analysis of Windows Error Reporting data shows that the Pareto principle is a surprisingly good rule of thumb for Windows crashes and hangs; about 20 percent of bugs cause 80 percent of failures. The same study revealed an even more surprising result: just 1 percent of bugs cause 50 percent of errors. Not all bugs are created equal, but it's quite an eye-opener to learn how heavily skewed the distribution is.

After studying those five failures more carefully, the programmer came to the realization that all of them had the same root cause. The crashes and hangs were just different manifestations of the same underlying bug. So he developed and tested a fix and, in conjunction with the folks in quality assurance, set into motion the steps for including the fix in the next patch. With that one bug fix, the theoretical failure rate for the component instantly dropped by a factor of 2.5. His mission was accomplished with one fairly straightforward code fix.

When I heard this story, I was amazed that the error distribution curve for the component was even sharper than for Windows. Well over half of all the crashes and hangs were caused by a single bug. This is the sort of insight into your code that isn't possible without the data collected by something like Windows Error Reporting.

You'd think the reliability team would be ecstatic to have met its goal so quickly. The agreed-upon target was exceeded in just two weeks with a single fix. Instead, they were furious.

"That's unacceptable—we expected this to take two months. It was too easy!" Apparently, they wanted the programmer to work harder, not smarter.

Raymond Chen's Web site, The Old New Thing, and identically titled book deal with Windows history and Win32 programming. Twenty percent of his clothes are responsible for eighty percent of his laundry.