Recursive Descent Parser: A Simple Grammar

To learn how recursive descent parsers work, it is helpful to implement a very simple grammar, so for pedagogical purposes, I’ve defined a grammar for simple arithmetic expressions. The parser will construct a syntax tree from expressions that we can then examine as necessary. Just for fun, after implementing the parser, we will write a small method that will evaluate the formulas.

This blog is inactive.
New blog: EricWhite.com/blog

Blog TOCThis post is one in a series on using LINQ to write a recursive-descent parser for SpreadsheetML formulas.  You can find the complete list of posts here.

In these expressions, operands are floating point numbers, but for simplicity, I’ve eliminated the ability to have exponents. Floating point numbers are the only variety of operand; to demonstrate how to write the parser, it’s not necessary to include the idea of variables or other types of operands. When writing the parser for Excel formulas, we’ll need to deal with both exponents and other types of operands, as well as many more varieties of issues.

There are five binary (sometimes called infix) operators. Operator precedence needs to be honored when parsing formulas:

Operator

Precedence

+

1

-

1

*

2

/

2

^

3

 

There is one prefix operator, ‘-‘, which has higher precedence than the infix operators.

Operands can consist of a significand (sometimes called mantissa) part, followed by a fractional part.

The grammar allows for white space at appropriate places in the expression. The allowance of white space in this simple grammar parallels the allowance of white space in the Excel Extensions to the Office Open XML SpreadsheetML File Format (.xlsx) Specification.

The grammar allows for use of parentheses.

Here are some examples of formulas that should parse properly:

"(1+3)/3"
" (1+3) "
"-123"
"1+2*(-3)"
"1+2*( - 3)"
"12.34"
".34"
"-123+456"
"-(123+456)"
" ( 123 + 456 ) "
"1+2-3*4/5^6"
"-.34"
"-12.34"

One of the important characteristics of a parser is to report when an expression is invalid per the grammar. Here are some expressions that should throw exceptions:

"-(123+)"
"-(*123)"
"*123"
"123a"
"1."
"--1"

Here is the grammar, as I’ve defined it. There are only eleven rules (other than the rules that are comprised only of terminals):

formula = expression
expression = *whitespace nospace-expression *whitespace
nospace-expression = open-parenthesis expression close-parenthesis /
    expression infix-operator expression / numerical-constant / prefix-operator expression
numerical-constant = [neg-sign] significand-part
significand-part = whole-number-part [fractional-part] / fractional-part
whole-number-part = digit-sequence
fractional-part = full-stop digit-sequence
neg-sign = minus
digit-sequence = 1*decimal-digit
prefix-operator = plus / minus
infix-operator = caret / asterisk / forward-slash / plus / minus

// The following symbols are comprised only of terminals.
decimal-digit = %x30-39
whitespace = %x20
plus = "+"
minus = "-"
asterisk = "*"
forward-slash = "/"
caret = "^"
full-stop = "."
open-parenthesis = "("
close-parenthesis = ")"

In the next post in this series, I’ll start discussing some of the C# / LINQ techniques that we can use to make coding this parser super-easy.