Matching Behavior
The .NET Framework regular expression engine is a backtracking regular expression matcher that incorporates a traditional Nondeterministic Finite Automaton (NFA) engine such as that used by Perl, Python, Emacs, and Tcl. This distinguishes it from faster, but more limited, pure regular expression Deterministic Finite Automaton (DFA) engines such as those found in awk, egrep, or lex. This also distinguishes it from standardized, but slower, POSIX NFAs.
Three Types of Regular Expression Engines
This section provides an overview of the benefits and drawbacks of the three types of engines and explains why the .NET Framework engine implements a traditional NFA matcher.
DFA engines run in linear time because they do not require backtracking (and thus they never test the same character twice). They can also guarantee matching the longest possible string. However, since a DFA engine contains only finite state, it cannot match a pattern with backreferences, and because it does not construct an explicit expansion, it cannot capture subexpressions.
Traditional NFA engines run so-called "greedy" match backtracking algorithms, testing all possible expansions of a regular expression in a specific order and accepting the first match. Because a traditional NFA constructs a specific expansion of the regular expression for a successful match, it can capture subexpression matches and matching backreferences. However, because a traditional NFA backtracks, it can visit exactly the same state multiple times if the state is arrived at over different paths. As a result, it can run exponentially slowly in the worst case. Because a traditional NFA accepts the first match it finds, it can also leave other (possibly longer) matches undiscovered.
POSIX NFA engines are like traditional NFA engines, except that they continue to backtrack until they can guarantee that they have found the longest match possible. As a result, a POSIX NFA engine is slower than a traditional NFA engine, and when using a POSIX NFA you cannot favor a shorter match over a longer one by changing the order of the backtracking search.
Traditional NFA engines are favored by programmers because they are more expressive than either DFA or POSIX NFA engines. Although in the worst case they can run slowly, you can steer them to find matches in linear or polynomial time using patterns that reduce ambiguities and limit backtracking.
.NET Framework Engine Capabilities
Taking the benefits of a traditional NFA engine to heart, the .NET Framework regular expression engine includes a complete set of constructs to allow programmers to steer the backtracking engine. These constructs can be used to find matches faster or to favor specific expansions over others.
Other features include the following capabilities:
"Lazy" quantifiers: ??, *?, +?, {n,m}?. These tell the backtracking engine to search the minimum number of repetitions first. In contrast, ordinary greedy quantifiers try to match the maximum number of repetitions first.
Positive lookahead. This allows the backtracking engine to return to the same spot in the text after matching a subexpression. It is useful for searching throughout the text by verifying multiple patterns starting from the same position.
Negative lookahead. This adds the ability to match an expression only if a subexpression fails to match. This is particularly powerful for pruning a search, because often an expression for a case that should be eliminated is much simpler than an expression for cases that must be included. (For example, it is difficult to write an expression for words that do not begin with "non".)
Conditional evaluation. This allows the engine to search using more than one alternate pattern, depending on the results of previous subexpression matches. This allows a more powerful form of backreference that permits, for example, matching a right parenthesis when a left parenthesis was previously captured in a subexpression.
Nonbacktracking subexpressions (also known as greedy subexpressions). This allows the backtracking engine to guarantee that a subexpression matches only the first match found for that subexpression, as if the expression were running independent of its containing expression. Without this construct, backtracking searches from the larger expression can change the behavior of a subexpression.
Right-to-left matching. This is useful when searching from right to left instead of from left to right or in cases where it is more efficient to begin a match at the right-hand part of the pattern rather than at the left.
Positive and negative lookbehind. Similar to lookahead. Because the regular expression engine allows complete right-to-left matching, regular expressions allow unrestricted lookbehinds.