Data Flow Analysis Rules in Visual Studio 2010

This blog post applies to Visual Studio 2010 Premium and Ultimate editions. The content was written by Kevin Blasko one of the developers on the Visual Studio Code Analysis team.

The purpose of this blog post is to give you some background on the implementation of the Dataflow Analysis (DFA) rules in Visual Studio 2010.

 

New Engine, New Rules for Visual Studio 2010

If you looked at the What’s New in Visual Studio 2010 Blog Post you will have noticed that there are several new Code Analysis rules for Managed Code. Several of these new rules are what we describe as DFA rules. These rules are implemented using the Phoenix Code Analysis engine. The rules are:

CA1062 Validate Arguments of Public Methods

CA1303 Do Not Pass Literals as Localized Parameters

CA2000 Dispose Objects Before Losing Scope

CA2100 Review SQL Queries For Security Vulnerabilities

CA2202 Do Not Dispose Objects Multiple Times

CA2204 Literals Should Be Spelled Correctly

CA2215 Dispose Methods Should Call Base Class Dispose

CA2241 Provide Correct Arguments to Formatting Methods

This list of rules should give you an idea of what a DFA engine is capable of.

 

Why Implement A New Code Analysis Engine?

We needed to implement a new analysis engine to improve the quality and performance of DFA rules. We chose to do this using a technology developed by Microsoft Research called Phoenix. Phoenix is an extensible compiler framework with advanced analysis capabilities. Static analysis turns out to be very important to compilers since the more information you have about the state of the program, the better optimizations you can make when compiling the program. We took advantage of the existing analysis infrastructure as a starting point and built our new engine on top of it. The engine is capable of producing better quality analysis with less noise, less missed analysis, and better performance than our previous DFA engine.

The Phoenix engine excels at analyzing function bodies. Whereas the engine used for the rest of the managed code rules is good at symbolic analysis of class and function definitions (which the Phoenix engine can do as well). For example, consider the following simple example that is actually difficult to analyze without an engine with Phoenix’s capabilities. For the sample, assume that we’re trying to detect violations of CA1062 Validate Arguments of Public Methods.

 

Code Snippet

  1. public void Function(string argument)
  2. {
  3.     string localVariable = argument;
  4.  
  5.     if (localVariable != null)
  6.     {
  7.         Console.WriteLine(argument.Length);
  8.     }
  9.     else
  10.     {
  11.         Console.WriteLine(localVariable.Length);
  12.     }
  13. }

Looking at “Function”, you can see that line 11 contains a CA1062 violation while line 7 does not, but how can you determine this programmatically? The code will first need to determine which object dereferences could potentially result in a violation. As the assignment on line 3 shows, this isn’t as straight forward as it first appears. In addition to looking for “argument” dereferences, you now need to look for dereferences of “localVariable”.

Now that you’ve found all of the potential dereferences, you need to determine whether or not it’s possible for “argument’s” value to be null for that particular dereference. If you’ve written code that is able to detect that “localVariable” dereferences need to be checked, it probably isn’t that hard to look for null checks against both “argument” and “localVariable”, so the next problem is finding the correct “if” statement that determines whether the dereference is executed. Like before, this is harder than it looks. It’s easy to tell that the dereferences occur within the scope of the “if-else” statement on line 5 and 9 by looking at the code. However, if you’re using an analysis engine whose object model is little more than a wrapper over the raw MSIL, the instructions that you’re analyzing have no concept of scope. The easiest thing that you can do is walk the instructions backwards looking for a branch instruction. But what happens if there’s a branch from further down in the program to line 7 or 11? That branch that you found could turn out to actually be a loop! Or even worse, the author could have inserted a goto that jumps directly to line 7!

Assuming that you get all of the control flow problems sorted out and are able to match the dereference to the branch instruction (maybe you build your own control flow graph or decide that walking the instructions back to the first branch is good enough), you now need to determine if the branch is conditional, and if so, if the condition is a null check. In cases like the one in the example, this will be easy enough, but in many cases, the condition will be the result of a more complex comparison or maybe not even a direct comparison at all. Depending on what you’re willing to accept here, things can get real complicated very fast.

So how does a DFA engine make this easier? The key benefit of a DFA engine is its ability to propagate known information about program state through a method using a control flow graph. New information about the state of the program is then discovered using the information that is known initially. The new information is then fed back into the system and used to further understand the state of the program. The information is propagated by applying a set of rules to the input data depending on the type of MSIL instruction that is being analyzed. This allows you to think about each propagation rule in isolation and only have to worry about what the output is for a narrow and more specific set of inputs.

In the example above, you would have a propagation rule for a compare instruction that takes “localVariable” and “null” as its inputs. Since we know that the literal value for “null” is null, we are able to infer new information about the null-ness of “localVariable” depending on which branch of the “if” statement is taken. As the output of this propagation rule, we would annotate each branch in the control flow graph with the null-ness information that we just inferred. This makes writing the Code Analysis rule simple. All we do is run the null-ness dataflow analysis, look for the dereferences of “argument” (which itself can be done as another dataflow analysis) and then extract the null-ness information for the object being dereferenced.

 

What Does This Cost Me?

This improved analysis does come at a cost to the Code Analysis user. It turns out that enabling even just one of the DFA rules will approximately double the time it takes to run code analysis. The reason for this is that both the Phoenix engine and the existing (Introspection aka CCI) engine need to load the target assembly and all the references into their own internal representations and then prepare those representations for rule consumption. The good news is that the amount of time it takes to execute all of the DFA rules isn’t that much different than executing one rule. To offset the performance hit, we’ve improved the performance of the Introspection engine so that it runs about 25% faster than the Visual Studio 2008 version.

NOTE: The Phoenix Code Analysis engine and associated DFA rules do not ship in the standalone FxCop release.