January 2012

Volume 27 Number 01

The Working Programmer - Building Combinators

By Ted Neward | January 2012

Ted NewardIn my December column (msdn.microsoft.com/magazine/hh580742), I looked at parser combinators, text parsers that are created by combining small, atomic parsing functions into larger functions, and those in turn into even larger functions, suitable for parsing non-trivial text files and streams. This is an interesting technique, one that builds on some core functional concepts, and it deserves deeper exploration.

Readers of the earlier column will recall that the parser we constructed to handle U.S.-style phone numbers worked, but the implementation was a bit … shall we say … quirky in places. In particular, the syntax for parsing three- and four-digit combinations—well, to be honest, it clunked. It worked, but it was hardly pretty, elegant or in any way scalable.

As a refresher, here’s the Phone Number parser code:

public static Parser<PhoneNumber> phoneParser =
  (from areaCode in areaCodeParser
   from _1 in Parse.WhiteSpace.Many().Text()
   from prefix in threeNumberParser
   from _2 in (Parse.WhiteSpace.Many().Text()).
        Or(Parse.Char('-').Many())
   from line in fourNumberParser
   select new PhoneNumber() { AreaCode=areaCode,
     Prefix=prefix, Line=line });

The PhoneNumber type is pretty guessable. Figure 1 shows the threeNumberParser and fourNumberParser, which, in particular, are what “clunk” with all the grace of a pro football defensive lineman attempting ballet for the first time on a stage greased with duck fat.

Figure 1 Clunky Parsers

public static Parser<string> threeNumParser =
  Parse.Numeric.Then(first =>
    Parse.Numeric.Then(second =>
      Parse.Numeric.Then(third =>
        Parse.Return("" + first.ToString() +
          second.ToString() + third.ToString()))));
public static Parser<string> fourNumParser =
  Parse.Numeric.Then(first =>
    Parse.Numeric.Then(second =>
      Parse.Numeric.Then(third =>
        Parse.Numeric.Then(fourth =>
          Parse.Return("" + first.ToString() +
            second.ToString() + third.ToString() +
            fourth.ToString())))));

This is hardly the kind of inspirational coding practice that I hope to convey within these pages. There’s a more elegant way to construct these, but to describe it, we need to dive a bit deeper into how parser combinators work. And that’s the subject of this month’s column. Not just because we need a more elegant way to construct parsers, mind you, but because the general technique helps describe how they work, and more important, how you might construct something like this in the future.

From Functions to Functions

The important point to realize about parser combinators is that a parser is really “just” a function: The function parses the text and may then transform the characters into something else. What that something else turns out to be is, of course, up to the person implementing the parser. It could be an abstract syntax tree (AST) for validation and verification of the text passed in (and later conversion into executable code or perhaps interpreted directly, as in some languages), or it could be a simple domain object, or even just values plugged into an existing class, like a dictionary of name-value pairs.

Speaking in code, then, a parser looks like the following:

T Parse<T>(string input);

In other words, a parser is a generic function, taking strings and returning an instance of something.

As simple as that is, though, it’s not entirely accurate. Were that the sum total of the story, we’d be back to writing a complete parser per function, which doesn’t really allow for much in the way of reuse. But if we look at parsing as a series of functions—in other words, a parser is made up of a bunch of little parsers, each of which knows how to parse just a piece of the input and return just a piece of the resulting object—it’s clear we need to return not only the resulting object, but also the remaining text that requires parsing. And that means  the “T” from the previous declaration has to be made slightly more complicated by wrapping it in a “Parser Result” type that contains both “T” and a string with the remaining parse text, like so:

public class ParseResult<T>
{
  public readonly T Result;
  public readonly string Rest;
  public ParseResult(T r, string i) {
    this.Result = r; this.Rest = i;
  }
}

And given that C# naturally manages functions as delegate types and instances, declaring a parser now becomes a delegate declaration:

public delegate ParseResult<T> ParseFn<T>(string input);

Now, we can imagine writing a series of small parsers that know how to parse text into some useful other type, such as a ParseFn<int> that takes a string and returns an int (see Figure 2), or a ParseFn<string> that parses up to the first whitespace character, and so on.

Figure 2 Parsing a String and Returning an Int

ParseFn<int> parseInt = delegate(string str)
  {
    // Peel off just numbers
    int numCount = 0;
    foreach (char ch in str)
    {
      if (Char.IsDigit(ch))
        numCount++;
      else
        break;
    }

    // If the string contains no numbers, bail
    if (numCount == 0)
        return null;
    else
    {
      string toBeParsed = str.Substring(0, numCount);
      return new ParseResult<int>(
        Int32.Parse(toBeParsed), str.Substring(numCount));
    }
  };
Assert.AreEqual(12, parseInt("12").Result);

Note that the parser implementation here is actually one that’s pretty repeatable: to write a parser that parses text up to a whitespace character, all you’d need to do is change the IsDigit call to an IsLetter call. This screams for a refactoring to use the Predicate<T> type to create an even more fundamental parser, but that’s an optimization we won’t attempt here.

This implementation is great for parsing little things such as integers and single words, but so far it doesn’t seem like too much of an improvement over the earlier version. This is more powerful, however, because you can combine functions by creating functions that take functions and return functions. These are called higher-order functions; while the theory is beyond the scope of this article, showing how they apply in this particular case isn’t. The starting point is when you create functions that know how to take two parser functions and combine them in a Boolean “AND” and “OR” fashion:

public static class ParseFnExtensions
{
  public static ParseFn<T> OR<T>(this ParseFn<T> parser1,
    ParseFn<T> parser2)
  {
    return input => parser1(input) ?? parser2(input);
  }
  public static ParseFn<T2> AND<T1, T2>(this ParseFn<T1> p1,
    ParseFn<T2> p2)
  {
    return input => p2(p1(input).Rest);
  }
}

Both of these are provided as extension methods on the ParseFn delegate type to allow for an “infix” or “fluent interface” style of coding, to make it more readable in the end, on the theory that “parserA.OR(parserB)” reads better than “OR(parserA, parserB).”

From Functions to LINQ

Before we leave this set of small examples, let’s take one more step and create three methods, as shown in Figure 3, that will essentially give the parser the ability to hook into LINQ, to provide a unique experience when writing code (this is one of Sprache’s features as well). The LINQ libraries and syntax are in close sync with one another, in that the LINQ syntax (“from foo in bar select quux q …”) is closely tied to the expectation that several method signatures are present and available for use. Specifically, if a class provides the Select, SelectMany and Where methods, then LINQ syntax can be used with them.

Figure 3 Where, Select and SelectMany Methods

public static class ParseFnExtensions {
  public static ParseFn<T> Where<T>(
    this ParseFn<T> parser,
    Func<T, bool> pred)
  {
    return input => {
      var res = parser(input);
      if (res == null || !pred(res.Result)) return null;
      return res;
     };
  }
  public static ParseFn<T2> Select<T, T2>(
    this ParseFn<T> parser,
    Func<T, T2> selector)
  {
    return input => {
      var res = parser(input);
      if (res == null) return null;
      return new ParseResult<T2>(selector(
        res.Result),res.Rest);
     };
  }
  public static ParseFn<T2> SelectMany<T, TIntermediate, T2>(
    this ParseFn<T> parser,
    Func<T, ParseFn<TIntermediate>> selector,
    Func<T, TIntermediate, T2> projector)
  {
    return input => {
      var res = parser(input);
      if (res == null) return null;
      var val = res.Result;
      var res2 = selector(val)(res.Rest);
      if (res2 == null) return null;
      return new ParseResult<T2>(projector(
        val, res2.Result),res2.Rest);
    };
  }
}

This gives LINQ the necessary methods to parse LINQ expressions, such as what you saw in the previous article.

I don’t want to go through the exercise of (re)designing a parser combinator library here; both Luke Hoban and Brian McNamara have excellent blog posts on the subject (bit.ly/ctWfU0 and bit.ly/f2geNy, respectively), which, I must point out, serve as the standard against which this column is being written. I want only to demonstrate the mechanism by which these kinds of parsers are constructed in a parser combinator library like Sprache, because that provides the core of the solution to the earlier problem of the three- and four-digit parsers in the phone number parser. In short, we need one parser combinator that reads exactly three digits, and another that reads exactly four digits.

Specifice

Given that the problem is to read precisely three digits and precisely four digits, it stands to reason that we want a function that reads exactly that number of characters from the input stream. The Sprache library doesn’t give us that kind of combinator—there’s a combinator that will read a repeating sequence of whatever-kind-of-character until it runs out of that-kind-of-character, but that’s a “zero-to-many” (hence its name, Many) production rule, not a specific-number-of-characters rule, and thus unhelpful. Looking at its definition can be interesting and insightful, however, as Figure 4 shows.

Figure 4 Definition of the Many Combinator

public static Parser<IEnumerable<T>> Many<T>(
  this Parser<T> parser)
{
  if (parser == null)
    throw new ArgumentNullException("parser");

  return i =>
  {
    var remainder = i;
    var result = new List<T>();
    var r = parser(i);
    while (r is ISuccess<T>)
    {
      var s = r as ISuccess<T>;
      if (remainder == s.Remainder)
          break;

      result.Add(s.Result);
      remainder = s.Remainder;
      r = parser(remainder);
    }

    return new Success<IEnumerable<T>>(result, remainder);
  };
}

For most developers, the hardest part of this method (and, indeed, the entire Sprache library) is that the return value from this function is a function—specifically, a lambda method (always a Parser<> of some form, remember) that takes the string and returns an IEnumerable<T> in its result structure; the meat of the actual parser is buried inside that returned function, which means it will be executed later, not now. This is a long way from merely returning a T!

Once that oddity is dealt with, the rest of the function is pretty clear: imperatively, we step through and call the passed-in Parser<T> to parse each “whatever”; and so long as the parser keeps returning successfully, we keep looping and adding the parsed results to a List<T> that gets returned when everything completes. This serves as the template for my extension to the library, which I’ll call Twice for now, essentially executing a given Parser<T> twice (and yielding an error if either parse fails), as shown in Figure 5.

Figure 5 The Twice Function

public static Parser<IEnumerable<T>> Twice<T>(
  this Parser<T> parser)
{
  if (parser == null)
    throw new ArgumentNullException("parser");

  return i =>
  {
    var remainder = i;
    var result = new List<T>();
    var r = parser(i);
    var c = 0;
    while (c < 2 && r is ISuccess<T>)
    {
      var s = r as ISuccess<T>;
      if (remainder == s.Remainder)
          break;

      result.Add(s.Result);
      remainder = s.Remainder;
      r = parser(remainder);

      c++;
    }

    return new Success<IEnumerable<T>>(result, remainder);
  };
}

In fact, while it would have been a bit easier to write this code by unrolling the loop-of-two into just two sets of imperative statements, remember, Twice isn’t specifically what we’re looking for. We need Thrice and Quadrice, and those are just special-case versions of Twice, with “3” and “4” instead of “2” in the code, which sounds as though we can extract them into a single method that takes the number of times to parse. I choose to call this method “Specifice,” because we’re parsing a specific number of times (see Figure 6).

Figure 6 The Specifice Method

public static Parser<IEnumerable<T>> Twice<T>(
  this Parser<T> parser) {
  return Specifice(parser, 2); }
public static Parser<IEnumerable<T>> Thrice<T>(
  this Parser<T> parser) {
  return Specifice(parser, 3); }
public static Parser<IEnumerable<T>> Quadrice<T>(
  this Parser<T> parser) {
  return Specifice(parser, 4);
}
public static Parser<IEnumerable<T>> Quince<T>(
  this Parser<T> parser) {
  return Specifice(parser, 5);
}
public static Parser<IEnumerable<T>> Specifice<T>(
  this Parser<T> parser, int ct)
{
  if (parser == null)
    throw new ArgumentNullException("parser");

  return i =>
  {
    var remainder = i;
    var result = new List<T>();
    var r = parser(i);
    var c = 0;
    while (c < ct && r is ISuccess<T>)
    {
      var s = r as ISuccess<T>;
      if (remainder == s.Remainder)
          break;

      result.Add(s.Result);
      remainder = s.Remainder;
      r = parser(remainder);

      c++;
    }

    return new Success<IEnumerable<T>>(result, remainder);
  };
}

Thus, we have now extended Sprache to parse exactly “ct” number of parses (characters, digits, whatever kind of Parser<T> we pass in), which opens up Sprache for use in fixed-length parsing scenarios, such as the ubiquitous fixed-length record text file, aka the flat file.

A Functional Approach

Sprache is a parser library for midsize parsing projects, for which regular expressions are too complex and a full-blown parser generator (such as ANTLR or lex/yacc) is overkill. It’s not perfect, in that parser failures generate error messages that can be difficult to understand, but once you’ve gotten through the initial complexities, Sprache can be a useful tool in your developer toolbox.

More than that, though, Sprache demonstrates some of the power and capability offered by a different approach to programming than what we’re used to—in this case, from the functional world. So next time somebody asks you, “What good comes from learning about other languages if you aren’t going to use them on the job?” there’s an easy response: “Even if you aren’t using a language directly, it can teach you techniques and ideas you can use.”

But that’s it for now. Next month, we’ll take a stab at something entirely different. Because there’s only so much concept and theory a language columnist’s audience can take at once.

Happy coding!                                                        


Ted Neward is an architectural consultant with Neudesic LLC. He’s written more than 100 articles, is a C# MVP and INETA speaker and has authored and coauthored a dozen books, including the recently released “Professional F# 2.0” (Wrox). He consults and mentors regularly. Reach him at ted@tedneward.com if you’re interested in having him come work with your team, or read his blog at blogs.tedneward.com.

Thanks to the following technical expert for reviewing this article: Nicholas Blumhardt


About the Author