A Model for Query Interpretation
Evaluating Query Expressions
After writing the code to translate query expressions and remove transparent identifiers, one of my first desires was to write a tool that would generate query expressions to test the correctness of the code. Of course, I already had a set of programmer tests in place to make sure that the code worked. But I also knew, that this feature exposed so many other features that it was sure to introduce pain points that did not previously exist.
Compiling the average query involves a large number of language features of which many are new: query expressions, anonymous types (transparent identifiers), extension methods (method calls), lambdas (parameters), expression trees, object initializers (anonymous type initialization), type inference (generic methods and lambdas), and iterators (LINQ query operators). Many of these features are among the more complex features in the language. Even if all of these features work correctly when evaluated separately, it seemed likely that their interaction would be the source of bugs. So I started thinking about what I could do to sample the space of query expressions to see if we were doing alright.
My initial idea was to generate random syntactically correct queries and then feed the query expressions to the C# compiler and an auxiliary translator and then compare the results. Fortunately, our fantastic QA team already had a tool that did translation given an object model of the queries. So, I put together the random query generator and QA plugged it into their translator and we achieved the first goal. We quickly verified that translation worked correctly and found a few spec and code issues which were quickly resolved. But we still hadn't tested the interaction of the many features that are used in query expressions, only the translation itself had been tested.
While driving home I thought, why can't we do more than translate the queries? Why can't we compile them? Why can't we run them? In order to do this, the generator should generate random semantically correct queries. With a little work (defining an algebra and formalizing notions of scope) on the generator, it was able to generate random (large and fairly complex although ultimately meaningless) semantically correct queries.
This was great and enabled us to find many very interesting bugs in the compiler. It opened our minds to some big performance problems in some of the new language features. It seemed that in several cases, the compiler generates code patterns that typical users didn't previously write and so we had to modify the compiler to enable that style of code to perform well both during compilation and at runtime. In fact, these findings fed back into the language design and we changed the query rewrite rules somewhat.
At first, we didn't give much serious thought to taking to the next step and running the generated queries, because we would have to know what results the queries should produce in order to really effectively test them. The big problem with predetermining the results is that queries are pure syntactic sugar for a series of method calls to methods which are not determined by the query expressions. In order to interpret the queries, we needed a way to interpret queries in terms of the queries and not the translation. Then I realized that we could interpret queries without translating them given a well defined provider. An interpreter for queries against the LINQ to objects provider was written. Once I implemented a query interpreter, we were able to test the translation of queries, the compilation of queries, and the runtime semantics of both the code that was generated and LINQ to objects (IEnumerable<T> and IQueryable<T>). We found more bugs, both in the compiler and in System.Core, doing this, including at least one bug in subtle interactions of the type system and nullables that had been introduced in Whidbey.
As interesting as the exercise was from a software engineering standpoint, I want to focus on the notion of interpreting queries. My last post discussed the syntax and scoping of queries. The formalisms that were developed to discuss the scoping of queries were actually conceived of while writing the semantically correct query generator. These scoping formalisms have been very helpful to our team in both understanding and writing queries. It has been just as helpful to have some notions of how to interpret queries without translating them. So, I thought I would share those with you in the hope that they will help you to understand queries better.
Interpretation without Translation
In this section, a way to interpret LINQ to objects queries without doing the translation will be described. The essential notion is that each clause takes a sequence of bindings and produces a sequence of bindings where a binding is a set of variables along with the values assigned to those variables. For example, consider the following query:
from x in foo
where x >= 2
select x + 1
Where
foo = new int[] { 0, 1, 2, 3 }
The first clause does not take a binding but it produces a sequence of bindings that bind the variable x to each element of foo in successive order. This will be denoted as:
[(x:0), (x:1), (x:2), (x:3)]
The next clause, the where clause filters the bindings based on the predicate. The resultant bindings are:
[(x:2), (x:3)]
Finally, the select clause generates the projection of the bindings in successive order. The results are:
3, 4
Seems simple enough, but lets try something more complex.
from x in foo
where x >= 2
from y in foo
where x <= y
orderby x descending, y descending
select x + y
Look at the bindings after each clause:
from x in foo
[(x:0),(x:1),(x:2),(x:3)]
where x >= 2
[(x:2),(x:3)]
from y in foo
[(x:2,y:0),(x:2,y:1),(x:2,y:2),(x:2,y:3),(x:3,y:0),(x:3,y:1),(x:3,y:2),(x:3,y:3)]
where x <= y
[(x:2,y:2),(x:2,y:3),(x:3,y:3)]
orderby x descending, y descending
[(x:3,y:3),(x:2,y:3),(x:2,y:2)]
select x + y
6,5,4
Notice, how after each clause we have a sequence of bindings that are then fed into the next clause. The bindings may grow to include more variables (like after the second from). Finally, the terminating clause (select or group...by) produces a projection or grouping based on the bindings. Here is how to handle each clause type:
from x in src
for each binding b
for each element e in src
extend b with x equals e
let x = e
for each binding b
extend b with x equals e
join x in src on outerKey equals innerKey
for each binding b
for each element e in src where outerKey equals innerKey
extend b with x equals e
join x in src on outerKey equals innerKey into g
for each binding b
for all elements e in src where outerKey equals innerKey
extend b with g equals group of all elements which satisfy previous condition
where p
for each binding b where p is true
pass b on
orderby k1, k2, ...
for each binding b
sort elements by first applying k1 to b, then k2 to b, and so on
select e
for each binding b
produce result by evaluating e in context of b
group e1 by e2
for all bindings b where e2 is equal
produce group result (key = e2, results = sequence of evaluation of e2 from b)
into x
for each result r
create binding with x equal to r
Here is a final example:
using System;
using System.Linq;class Program
{
static void Main(string[] args)
{
var foo = new[] { 2, 0, 3, 1 };
var bar = new[] { "fred", "jim", "bob", "george" };var q = from x in foo // [(x:2),(x:0),(x:3),(x:1)]
where x >= 2 // [(x:2),(x:3)]
from y in bar // [(x:2,y:"fred"),(x:2,y:"jim")
// ,(x:2,y:"bob"),(x:2,y:"george")
// ,(x:3,y:"fred"),(x:3,y:"jim")
// ,(x:3,y:"bob"),(x:3,y:"george")]
where y.Length > 3 // [(x:2,y:"fred"),(x:2,y:"george")
// ,(x:3,y:"fred"),(x:3,y:"george")]
orderby x, y // [(x:2,y:"fred"),(x:2,y:"george")
// ,(x:3,y:"fred"),(x:3,y:"george")]
select new { x, y }; // { x = 2, y = fred }
// { x = 2, y = george }
// { x = 3, y = fred }
// { x = 3, y = george }foreach (var item in q) Console.WriteLine(item);
}
}
And here is the output from the program:
{ x = 2, y = fred }
{ x = 2, y = george }
{ x = 3, y = fred }
{ x = 3, y = george }
Implications
It is fascinating, although not totally surprising, that given a well defined provider, a model for interpreting the queries can be formed. Given our model for interpreting LINQ to objects queries, several facts become apparent.
1. Each query clause is evaluated independently of the previous clauses. Only the binding information flows between them.
2. The clauses are evaluated in order, hinting at imperative evaluation.
Given these facts, it is easy to see that LINQ to objects queries are analogous to the relational algebra and therefore they are not as much declarative in nature as they are a specification of a series of operations. This leads to several considerations that should be taken. Because it is closer to the relational algebra than it is to the relational calculus, there are performance implications based upon the order of clauses. Consider the following two queries:
var q1 = from x in foo
from y in bar
where p(x)
select f(x,y);
var q2 = from x in foo
where p(x)
from y in bar
select f(x, y);
The two queries look very similar, but if we are evaluating the query against the LINQ to objects provider then two queries are quite different. For this provider there is no query planner and so while they produce the same results, the second query probably performs less steps. Of course, it is not clear whether or not the performance difference will matter in a given application but it is wise to know the difference and know why there is a difference.
The number of elements that are filtered by the where clause in the first query is equal to the number of elements in foo times the number of elements in bar. The number of elements that are filtered by the second query is equal to the number of elements in foo. Order matters when providers model the relational algebra. (Try interpreting the query by hand for some values of foo and bar)
In the next post, we will see that other query providers expose a model that is more declarative in nature rather than imperative.
Merry Querying!
Comments
Anonymous
December 29, 2006
Welcome to the sixteenth Community Convergence. This column comes out about once a week and is designedAnonymous
December 29, 2006
Welcome to the sixteenth Community Convergence. This column comes out about once a week and is designedAnonymous
December 29, 2006
Welcome to the sixteenth Community Convergence. This column comes out about once a week and is designedAnonymous
January 02, 2007
If you have ever tried to step through a Linq to Objects query in the debugger, you may have been mildlyAnonymous
January 07, 2007
If you are the kind of guy fascinated by the C# language and LINQ who wants to know more about what'sAnonymous
January 07, 2007
If you are the kind of guy fascinated by the C# language and LINQ who wants to know more about what'sAnonymous
January 07, 2007
If you are the kind of guy fascinated by the C# language and LINQ who wants to know more about what'sAnonymous
January 08, 2007
An imperative model for interpreting Linq to Objects queries has already been discussed , but are thereAnonymous
January 09, 2007
This concludes my series of posts about queries. I will still discuss them occassionally and if anyoneAnonymous
January 09, 2007
This concludes my series of posts about queries. I will still discuss them occassionally and if anyoneAnonymous
February 12, 2007
A co-worker just sent me a link to your blog, and I've been working through your archived posts (using the January Orcas CTP). In your second example of bindings, you have: orderby x descending, y descending [(x:3,y:3),(x:2,y:2),(x:2,y:3)] select x + y 6,4,5 If it's sorting by x (descending), then by y (descending), shouldn't that be: orderby x descending, y descending [(x:3,y:3),(x:2,y:3),(x:2,y:2)] select x + y 6,5,4 Thanks for some very interesting and informative articles!Anonymous
February 12, 2007
Yes, you are absolutely right. Thanks for the catch. Glad you are enjoying it.Anonymous
June 03, 2007
Are there any plans to ship a LINQ to objects query planner? Would the compilers (C# and VB) will be modified so they always do .Where( ... ) before anything else in the expression tree?Anonymous
June 04, 2007
I would imagine that the P-LINQ (Parallel LINQ) project would have to include such a query planner in order to determine if-at-all / how-to best distribute a LINQ to Objects query over the available CPU cores.