Поделиться через


A Simple Lexer in C# That Uses Regular Expressions

Lately, I’ve been working on a (personal) project to build a DSL using Vaughan Pratt’s top down operator precedence parsing technique, which I discovered through Douglas Crockford’s paper on the same subject (and through my friend Matt Manela). Because Crockford’s paper leaves the lexer as an exercise for the reader, I set out to write a very simple lexer in C# that would provide tokens to my parser.

Design

I chose to use regular expressions for token definitions rather than a more traditional, character-based scanner; my time developing ColorCode (which is essentially a specialized lexer) left a mental imprint for a regular expression-based lexer and I decided to run with it. The token definitions, then, are just a collection of regular expressions wrapped with an associated token type identifier (e.g., “operator” or “literal”) and a flag indicating whether they should be ignored:

     public class TokenDefinition
    {

        public bool IsIgnored { get; set; }
        public Regex Regex { get; set; }
        public string Type { get; set; }
    }

In Crockford’s paper, the lexer provides an array of tokens to the parser, but I want this lexer to fit C#’s paradigms as best I can. To that end, I think the lexer should return IEnumerable<Token>. This, then, is my lexer’s contract:

     public interface ILexer
    {
        void AddDefinition(TokenDefinition tokenDefinition);
        IEnumerable<Token> Tokenize(string source);
    }

 

Each token will have a type, value, and position (that is, its position within the source code):

     public class Token
    {
        public TokenPosition Position { get; set; }
        public string Type { get; set; }
        public string Value { get; set; }
    }
     public class TokenPosition
    {
        public int Column { get; set; }
        public int Index { get; set; }
        public int Line { get; set; }
    }

That’s the entire contract. It’s design is very simple, as you’d expect.

Implementation

The tokenize method will test the current source code position against each definition’s regular expression until it finds a match. When it finds a match, it will yield the matched token. If it doesn’t find a match, it will throw an exception for an unrecognized symbol. It will also track the current index, line number, and column position within the source code as it tokenizes. It will finally yield an end token when it reaches the end of the source code.

     public IEnumerable<Token> Tokenize(string source)
    {
        int currentIndex = 0;
        int currentLine = 1;
        int currentColumn = 0;

        while (currentIndex < source.Length)
        {
            TokenDefinition matchedDefinition = null;
            int matchLength = 0;

            foreach (var rule in tokenDefinitions)
            {
                var match = rule.Regex.Match(source, currentIndex);

                if (match.Success && (match.Index - currentIndex) == 0)
                {
                    matchedDefinition = rule;
                    matchLength = match.Length;
                    break;
                }
            }

            if (matchedDefinition == null)
            {
                throw new Exception(string.Format("Unrecognized symbol '{0}' at index {1} (line {2}, column {3}).", source[currentIndex], currentIndex, currentLine, currentColumn));
            }
            else
            {
                var value = source.Substring(currentIndex, matchLength);

                if (!matchedDefinition.IsIgnored)
                    yield return new Token(matchedDefinition.Type, value, new TokenPosition(currentIndex, currentLine, currentColumn));

                var endOfLineMatch = endOfLineRegex.Match(value);
                if (endOfLineMatch.Success)
                {
                    currentLine += 1;
                    currentColumn = value.Length - (endOfLineMatch.Index + endOfLineMatch.Length);
                }
                else
                {
                    currentColumn += matchLength;
                }

                currentIndex += matchLength;
            }
        }

        yield return new Token("(end)", null, new TokenPosition(currentIndex, currentLine, currentColumn));
    }

Simple, as promised.

Because it evaluates the token definitions top to bottom, you can use the order for precedence (if you care about the precedence of lexical definitions; I haven’t encountered that need myself yet).

The complete source is available for download.

SimpleLexer.zip