Persistence, Facades and Roslyn's Red-Green Trees
We decided early in the Roslyn design process that the primary data structure that developers would use when analyzing code via Roslyn is the syntax tree. And thus one of the hardest parts of the early Roslyn design was figuring out how we were going to implement syntax tree nodes, and what information they would proffer up to the user. We would like to have a data structure that has the following characteristics:
- Immutable.
- The form of a tree.
- Cheap access to parent nodes from child nodes.
- Possible to map from a node in the tree to a character offset in the text.
- Persistent.
By persistence I mean the ability to reuse most of the existing nodes in the tree when an edit is made to the text buffer. Since the nodes are immutable, there's no barrier to reusing them, as I've discussed many times on this blog. We need this for performance; we cannot be re-parsing huge wodges of text every time you hit a key. We need to re-lex and re-parse only the portions of the tree that were affected by the edit (*), because we are potentially re-doing this analysis between every keystroke.
When you try to put all five of those things into one data structure you immediately run into problems:
- How do you build a tree node in the first place? The parent and the child both refer to each other, and are immutable, so which one gets built first?
- Supposing you manage to solve that problem: how do you make it persistent? You cannot re-use a child node in a different parent because that would involve telling the child that it has a new parent. But the child is immutable.
- Supposing you manage to solve that problem: when you insert a new character into the edit buffer, the absolute position of every node that is mapped to a position after that point changes. This makes it very difficult to make a persistent data structure, because any edit can change the spans of most of the nodes!
But on the Roslyn team we routinely do impossible things. We actually do the impossible by keeping two parse trees. The "green" tree is immutable, persistent, has no parent references, is built "bottom-up", and every node tracks its width but not its absolute position. When an edit happens we rebuild only the portions of the green tree that were affected by the edit, which is typically about O(log n) of the total parse nodes in the tree.
The "red" tree is an immutable facade that is built around the green tree; it is built "top-down" on demand and thrown away on every edit. It computes parent references by manufacturing them on demand as you descend through the tree from the top. It manufactures absolute positions by computing them from the widths, again, as you descend.
You, the consumer of the Roslyn API, only ever see the red tree; the green tree is an implementation detail. (And if you use the debugger to peer into the internal state of a parse node you'll in fact see that there is a reference to another parse node in there of a different type; that's the green tree node.)
Incidentally, these are called "red/green trees" because those were the whiteboard marker colours we used to draw the data structure in the design meeting. There's no other meaning to the colours.
The benefit of this strategy is that we get all those great things: immutability, persistence, parent references, and so on. The cost is that this system is complex and can consume a lot of memory if the "red" facades get large. We are at present doing experiments to see if we can reduce some of the costs without losing the benefits.
(*) Determining what those portions of the tree are is quite tricky; I might blog about that at a later date. An edit that, for example, adds "async" to a method can cause the parse of "await(foo);" in the method body to change from an invocation to a usage of the await contextual keyword.
Comments
Anonymous
June 08, 2012
This post could have used some pictures of the whiteboard to illustrate the concepts presented.Anonymous
June 08, 2012
The comment has been removedAnonymous
June 08, 2012
"We are at present doing experiments to see if we can reduce some of the costs without losing the benefits." Hmm, isn't a bit late in the project to do something like this? :) @Steven Proctor: The syntax tree nodes don't have a fixed number of children, after all they need to represent the syntax, a binary operator will probably have 3 children (left, op, right) and an "if" could have at least 4 (if, cond, then block, else block).Anonymous
June 08, 2012
Re: Mike I think Roslyn is C# 5.0, and Visual Studio 2012 will only come out with C# 4.5, so without RoslynAnonymous
June 08, 2012
Given that the red tree is both immutable and created on-demand, then presumably if you ask for the same child node twice you'll get two different objects. (Because otherwise you'd have to alter parent nodes to keep references to the freshly generated children.) So the red tree is rebuilt and thrown away every time you traverse it, not just every time there's an edit. Is that right?Anonymous
June 08, 2012
@Andomar You're confusing different things. Visual Studio 2012 comes with C# 5.0 (biggest change: await) and .Net 4.5. The finished version of Roslyn is supposed to be released some time (probably quite a long time) after Visual Studio 2012. But since Roslyn doesn't change the C# language or the .Net framework in any way, it may be released without changing the version number of C# or .Net.Anonymous
June 08, 2012
Thanks a lot for covering this, I have been wondering about this and exact three problems you listed! Your solution makes a lot of sense. I guess GC pressure must have been another important factor you considered in your design? How bad is discarding the red tree on every edit in that regard? How often is the full red tree needed? Other blog posts about the internals of Roslyn would be very welcome! In particular, design decisions regarding the handling of trivia and the bindings between the syntactic tree and semantic model would be great!Anonymous
June 08, 2012
... so the purple text is coming back /soon/, right? :) OT: this sounds like a really neat solution: simple (logically, at least!) and flexible - I'll probably end up stealing the idea at some point! I can potentially see a usability issue with people assuming identity holds across traversals, but even on that metric the tradeoff seems pretty clearly in favor of immutability. I'd love to know what sort of performance cost you're looking at though, compared to a mutable tree: are we talking 10%? 2x? 10x? Even then I'd probably pick the immutable implementation in many cases. (looks like it dropped my post again?)Anonymous
June 08, 2012
Great information as usual. I am glad the markers were not Black and Blue, that would have given rise to some interesting speculation...Anonymous
June 09, 2012
Hi Eric Would be that kind and response to my comment on SO? stackoverflow.com/.../10417510 ThxAnonymous
June 09, 2012
I love this kind of fly-on-the-wall posts. @Mike: I hope it's never too late to revisit implementation details. Roslyn is a large project, but since it doesn't implement all of C# or VB, it couldn't even have come to the point where it passes all tests, leaving plenty of time for wringing out performance before it needs to settle. Maybe you're confusing it with the public APIs that they want to settle early on so that people can get comfortable with them and know that the ground's not going to be moving beneath them as they start to build on them. @carlos: He didn't say "always" created on demand. It would be trivial to keep returning the same red node if the green tree (or maybe even green node) hadn't changed.Anonymous
June 09, 2012
I assumed part of the reason for making these nodes immutable was so they could be used from multiple threads without need for any locks (or even for lock-free techniques, for that matter). But if the red nodes are created on demand, I take it that must mean you do have some locking around that lazy-creation?Anonymous
June 09, 2012
The comment has been removedAnonymous
June 10, 2012
Sounds fairly close to zippers (www.haskell.org/.../Zipper)Anonymous
June 10, 2012
Regarding your comment on the migrated settings, I think it would only be fair that you should save your fellow developers some work and just officially change your name to Carl Nolan!Anonymous
June 10, 2012
@JesperEC: If everything is immutable, how can you keep a reference to a red node to return it more than once?Anonymous
June 10, 2012
@carlos the reference to the red node doesn't have to be in either of the trees. i.e. the red node creation function can have its own memoization state @Eugene Zippers was exactly what I was thinking tooAnonymous
June 11, 2012
Am I the only one who has read this before?Anonymous
June 11, 2012
I'm trying to think of some witty comment on immutability and Lippert/Nolan name changes, but I'm coming up dry. Anyone else?Anonymous
June 11, 2012
Is it just me or it's this just a layering problem? Like with images and databases how you add abstractions that are only needed to get the result? For an image application you start with a base image and just add layers to hide or enhance the pixels and with databases you add indexes and views until you get your data in an efficient format and performance desired. So why not use multiple index trees that only contain extremely narrow defined scope and purpose, plus a compact and/or compressible throw-away structure that makes the green structure real time usable. I would think that this would have multiple advantages, where Roslyn could pick and choose which one to use depending on the layout and you can add more types in the future that it could take advantage of without changing the processing engine. Now it's starting to sound like a database engine..., so if a mouse can be a database, why not a syntax tree? This of course is an over simplification, but... does not this still hold true? Or at least not false?Anonymous
June 12, 2012
We got the purple text back, but the background lacks a blue gradient. Anyone cares to make a greasemonkey script to restore the old design? It was way better.Anonymous
June 13, 2012
If you would like to introduce a new keyword to cancel an async block, I will recommend "await break", which is consistent with iterator keyword "yield break".Anonymous
September 29, 2012
>we are potentially re-doing this analysis between every keystroke. OMG! I pray that in a few years the C# background checker would finally come close to what VB had for decade.