Monadic Parser Combinators using C# 3.0
Parser combinators are an idea that I enjoy every time I go back and look at again. They are an approach to building parsers by composing very simple atomic parsers into bigger and bigger units which can ultimately express real world grammars. This idea has been particularly popular in functional languages where the parsers can naturally be thought of as functions from input strings to parse trees, and composition of parsers is just function composition. This approach often leads to a simple syntax which makes the resulting parsers pleasantly declarative in that internal-DSL kind of way.
There's been plenty of great research and implementation of parser combinators over the last 15 years. I can't describe it all, or even list all the references here - so let me just point you to a couple of fun papers co-authored by my Microsoft colleagues Erik Meijer and Daan Leijen. These papers also include many references out to the broader set of research papers on the topic.
Building a Parser Combinator Framework in C#3.0
The code below is some sample code I wrote shortly after getting ahold of one of the first C#3.0 prototype compilers a couple years ago. It uses some of the new languages features in interestingly unusual ways, which I'll highlight as I go along. For reference, it is most directly influenced by the great parser combinator example given near the end of Scala by Example.
Step 1: A Type for Representing Parsers
using System; using System.Linq; using System.Text; // The result of a parse consists of a value and the unconsumed input. public class Result<TInput, TValue> { public readonly TValue Value; public readonly TInput Rest; public Result(TValue value, TInput rest) { Value = value; Rest = rest; } } // A Parser is a delegate which takes an input and returns a result. public delegate Result<TInput, TValue> Parser<TInput, TValue>(TInput input);
Taking a cue from the functional language implementations, our type for representing parsers will be a delegate type. This naturally captures the idea that a parser is a function which takes in some input, and returns a value as well as the unconsumed part of the input.
Step 2: Provide Infix AND and OR Operations on Parsers
public static class ParserCombinatorExtensions { public static Parser<TInput, TValue> OR<TInput, TValue>( this Parser<TInput, TValue> parser1, Parser<TInput, TValue> parser2) { return input => parser1(input) ?? parser2(input); } public static Parser<TInput, TValue2> AND<TInput, TValue1, TValue2>( this Parser<TInput, TValue1> parser1, Parser<TInput, TValue2> parser2) { return input => parser2(parser1(input).Rest); } }
Before building any actual parsers, we'll just come up with ways to combine them. There are two obvious ways. Try the first one, and if it fails, apply the second one. This corresponds to alternate productions in a BNF grammar. Or we could apply the first one, and then apply the second one to whatever input is left. This corresponds to sequencing (juxtaposition) in BNF grammars.
Note that these are extension methods applied to the Parser type, which is a delegate. This is something we couldn't have done in C# prior to C# 3.0, because there was no way to add methods to a delegate type. The "infix" syntax we get by making these extension methods instead of plain static methods is very important for the readability of the resulting parsers, where ordering is critical, as the example at the end of this post will show.
Step 3: Support Composing Parsers using C# Query Expressions
public static class ParserCombinatorsMonad { // By providing Select, Where and SelectMany methods on Parser<TInput,TValue> we make the // C# Query Expression syntax available for manipulating Parsers. public static Parser<TInput, TValue> Where<TInput, TValue>( this Parser<TInput, TValue> parser, Func<TValue, bool> pred) { return input => { var res = parser(input); if (res == null || !pred(res.Value)) return null; return res; }; } public static Parser<TInput, TValue2> Select<TInput, TValue, TValue2>( this Parser<TInput, TValue> parser, Func<TValue, TValue2> selector) { return input => { var res = parser(input); if (res == null) return null; return new Result<TInput, TValue2>(selector(res.Value), res.Rest); }; } public static Parser<TInput, TValue2> SelectMany<TInput, TValue, TIntermediate, TValue2>( this Parser<TInput, TValue> parser, Func<TValue, Parser<TInput, TIntermediate>> selector, Func<TValue, TIntermediate, TValue2> projector) { return input => { var res = parser(input); if (res == null) return null; var val = res.Value; var res2 = selector(val)(res.Rest); if (res2 == null) return null; return new Result<TInput, TValue2>(projector(val, res2.Value), res2.Rest); }; } }
This is where the fun really starts! Any type which implements Select, SelectMany and Where methods supports (part of) the "query pattern" which means we can write C#3.0 queries including multiple froms, an optional where clause and a select clause to process objects of this type. This can make it a *lot* easier to combine parsers, and makes the syntax more transparent. That the parsers we are using here support these three operations is captured in the idea that this parser type is a monad - something very much in LINQ's blood - which I hope to write more about in a future post.
Step 4: The Fundamental Parser Combinator Building Blocks
// Contains all the basic parsers that are independent of return type. public abstract class Parsers<TInput> { public Parser<TInput, TValue> Succeed<TValue>(TValue value) { return input => new Result<TInput, TValue>(value, input); } public Parser<TInput, TValue[]> Rep<TValue>(Parser<TInput, TValue> parser) { return Rep1(parser).OR(Succeed(new TValue[0])); } public Parser<TInput, TValue[]> Rep1<TValue>(Parser<TInput, TValue> parser) { return from x in parser from xs in Rep(parser) select (new[] { x }).Concat(xs).ToArray(); } } // Contains a few parsers that parse characters from an input stream public abstract class CharParsers<TInput> : Parsers<TInput> { public abstract Parser<TInput, char> AnyChar { get; } public Parser<TInput, char> Char(char ch) { return from c in AnyChar where c == ch select c; } public Parser<TInput, char> Char(Predicate<char> pred) { return from c in AnyChar where pred(c) select c; } }
So now we need to write some (almost) real parsers. The Succeed parser always succeeds with the provided result without consuming any input. Rep and Rep1 apply a parser repeatedly, they differ only in whether they require the parser to parse at least once. Note that Rep1 uses Query Expressions to capture the idea of parsing using "parser", then parsing using "Rep(parser)" then returning a result built out of the two individual results.
The AnyChar and Char parsers parse actual characters out of the input stream. AnyChar parses any single character, whereas the two overloads of Char only accept certain classes of characters. Note that AnyChar is defined abstract because it can only be implemented once we've decided what kind of input we're going to parse - but we want to be able to build on it without committing to a particular input type.
Step 5: An Example Parser for a MiniML Language
// Term and its derived classes define the AST for terms in the MiniML language. public abstract class Term { } public class LambdaTerm : Term { public readonly string Ident; public readonly Term Term; public LambdaTerm(string i, Term t) { Ident = i; Term = t; } } public class LetTerm : Term { public readonly string Ident; public readonly Term Rhs; public Term Body; public LetTerm(string i, Term r, Term b) { Ident = i; Rhs = r; Body = b; } } public class AppTerm : Term { public readonly Term Func; public readonly Term[] Args; public AppTerm(Term func, Term[] args) { Func = func; Args = args; } } public class VarTerm : Term { public readonly string Ident; public VarTerm(string ident) { Ident = ident; } } // Provides a set of parsers for the MiniML Language defined above. public abstract class MiniMLParsers<TInput> : CharParsers<TInput>{ public MiniMLParsers() { Whitespace = Rep(Char(' ').OR(Char('\t').OR(Char('\n')).OR(Char('\r')))); WsChr = chr => Whitespace.AND(Char(chr)); Id = from w in Whitespace from c in Char(char.IsLetter) from cs in Rep(Char(char.IsLetterOrDigit)) select cs.Aggregate(c.ToString(),(acc,ch) => acc+ch); Ident = from s in Id where s != "let" && s != "in" select s; LetId = from s in Id where s == "let" select s; InId = from s in Id where s == "in" select s; Term1 = (from x in Ident select (Term)new VarTerm(x)) .OR( (from u1 in WsChr('(') from t in Term from u2 in WsChr(')') select t)); Term = (from u1 in WsChr('\\') from x in Ident from u2 in WsChr('.') from t in Term select (Term)new LambdaTerm(x,t)) .OR( (from letid in LetId from x in Ident from u1 in WsChr('=') from t in Term from inid in InId from c in Term select (Term)new LetTerm(x,t,c))) .OR( (from t in Term1 from ts in Rep(Term1) select (Term)new AppTerm(t,ts))); All = from t in Term from u in WsChr(';') select t; } public Parser<TInput, char[]> Whitespace; public Func<char, Parser<TInput, char>> WsChr; public Parser<TInput, string> Id; public Parser<TInput, string> Ident; public Parser<TInput, string> LetId; public Parser<TInput, string> InId; public Parser<TInput, Term> Term; public Parser<TInput, Term> Term1; public Parser<TInput, Term> All; }
Here we've used the parsers built so far to implemented a parser for a (toy) language called MiniML. You can immediately see that the parser is structured much like the BNF it implements - which is the internal-DSL concept I mentioned above. You can also see how C# Query Expressions are used to help with this - in a context that is very different than their standard query usage.
Step 6: Parse Something!
public class MiniMLParserFromString: MiniMLParsers<string> { public override Parser<string, char> AnyChar { get { { return input => input.Length > 0 ? new Result<string,char>(input[0], input.Substring(1)) : null; } } } }
class Program { static void Main(string[] args) { MiniMLParserFromString parser = new MiniMLParserFromString(); Result<string,Term> result = parser.All(@"let true = \x.\y.x in let false = \x.\y.y in let if = \b.\l.\r.(b l) r in if true then false else true;"); } }
We finally have something we can execute! MiniMLParserFromString makes everything concrete by describing how to parse characters from a string input. And then in the Main method we see an example which parsers a little example expression from the MiniML language.
Conclusion
The little parser combinator framework here is very much a toy implementation. It's horribly inefficient, doesn't do error reporting, and doesn't support any form of error recovery. But it's a fun 150 lines of code! If you are interested in the more practical applications of parser combinators - these papers provide a lot of good ideas:
- "Efficient Parser Combinators", Pieter Koopman and Rinus Plasmeijer, 1998
- "Fast, Error Correcting Parser Combinators: A Short Tutorial", S. Doaitse Swierstra and Pablo R. Azero Alcocer, 1999
- "Parsec: Direct Style Monadic Parser Combinators for the Real World", Daan Leijen and Erik Meijer, 2001.
And here's the code:
Comments
Anonymous
August 19, 2007
PingBack from http://msdnrss.thecoderblogs.com/2007/08/19/monadic-parser-combinators-using-c-30/Anonymous
September 16, 2007
Welcome to the thirtieth Community Convergence. This edition has a wide range of articles on numerousAnonymous
November 14, 2007
A few weeks back, Soma blogged about an increased investment by the Microsoft Developer Division in theAnonymous
November 15, 2007
Luke Hoban is now full time as program manager on F#, and has just posted a short introduction aboutAnonymous
November 15, 2007
Luke Hoban is now full time as program manager on F#, and has just posted a short introduction aboutAnonymous
December 08, 2007
Продолжение беседы про прикладную функциональщину. Попытка реализовать парсер пAnonymous
December 13, 2007
If you just write some code without comments and tests nobody will understend it ...Anonymous
December 16, 2007
Dan - Fair point about the lack of comments and unit tests! The original code I sent around internally had a lot more comments, but for the blog post I moved these into the paragraphs between code blocks to get across the general approach being used. As for unit tests, about half the code in the sample provided is actually a test, which provides 100% code coverage on the code provided. But the code is no doubt very terse, and this does not necessarily help the understandability of the code.Anonymous
September 17, 2008
Does someone have more examples with comments?Anonymous
September 22, 2008
Предыдущая серия: http://blogs.gotdotnet.ru/personal/bezzus/PermaLink.aspx?guid=2F2B4B66-809C-441B-BDE9-FA45D71445F3...Anonymous
September 12, 2009
Where in the "scala by example" book the example about combinators? I can't find it, could you specify page number?Anonymous
September 21, 2009
I like the code syntax, but I was confused for a moment by the English in Step 1. "Queue" is a data structure. "Cue" is a stage direction in theater. I think you meant the latter, not the former.Anonymous
November 25, 2009
There is a bug in AND combinator. If first parser fails, parser1(input).Rest will cause null reference exception. Here the correct version: public static Parser<TInput, TValue2> AND<TInput, TValue1, TValue2>(this Parser<TInput, TValue1> parser1, Parser<TInput, TValue2> parser2) { return input => { var p1 = parser1(input); return p1 != null ? parser2(p1.Rest) : null; }; }Anonymous
November 26, 2009
Schernichkin - Thanks, good catch, I've updated the code sample. Chris Falter - Indeed,"cue" is the right word here, I've fixed that up.Anonymous
December 03, 2009
I quite like using operator| for OR, and & for AND. Unfortunately, you can't define & for parsers with different result types, because the syntax would be illegal: public static Parser<Token, Value2> operator &<Value2>(Parser<Token, Value> left, Parser<Token, Value2> right) Oh well, can't have everything.Anonymous
October 05, 2010
The Linq style of combinating really helps! For the sake of readability I would have liked some n and classes moved to seperate files though.Anonymous
October 13, 2010
Just posting a link to FParsec (www.quanttec.com/fparsec), a parser combinator library written in F# / C# that runs in .Net. FParsec is very robust and well-optimized. I've used it for parsing really large files. FParsec also has great error reporting and enough samples to get started. I recommend it not only for parsing programming languages, but for any situation where regular expressions will be too hard to write or too slow. CheersAnonymous
February 10, 2011
You liked to the Daan Leijen's draft paper, not the final copy. Here is the link to the final copy: citeseerx.ist.psu.edu/.../download