Paper Summary #2: Where the Bugs Are

Who wrote it:
Thomas J. Ostrand, Elaine J. Weyuker, Robert M. Bell from ATT Labs - Research

What's it called:
Where the Bugs Are

Where was it published:
ISSTA 2004

Background:
Another really interesting area of research to us Team System folks is predicting where program faults will occur.  I've got a bunch of papers in this area that I'd like to share.  Given access to a bug database and source control history, can you predict where future bugs will occur?

This paper's contribution:
In this paper, the researchers ran regressions on several "explanatory variables" to try to determine which files would result in more faults in the product.  The first really interesting results is that 20% of all files have 80% of all bugs.  Files with many prior faults were found to be an important component in their model.  In other words, files that had faults in them would usually continue to have faults in them.  Also, even more important in their prediction equation was log(KLOC) (i.e. the log of thousands of lines of code).    On the one hand you would expect this ("more code -> more bugs"), but I wonder if another component of this is that the file with a ton of code in it was not broken into pieces because the developer didn't understand the complexity of the components being developed well enough to subdivide it.  They also give boosts to new files (fresh bugs!) and changed files (fixes that regress).

The same authors had a related paper in the Workshop on Dynamic Analysis about using their models to drive testing.  The basic idea is that now that you know where the future faults are, you should test them heavier.