ECMAScript vs. Canonical Matching Behavior

Article
11/03/2006

The behavior of ECMAScript and canonical regular expressions differs in three areas:

Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.
A regular expression capture class with a backreference to itself must be updated with each capture iteration.
Ambiguities between octal escapes and backreferences are treated differently.

The following table summarizes the differences in octal versus backreference interpretation by canonical and ECMAScript regular expressions.

Canonical regular expression behavior ECMAScript behavior

Canonical regular expression behavior	ECMAScript behavior
If \ is followed by 0 followed by 0 to 2 octal digits, interpret as an octal. For example, `\044` always means '$'.	Same behavior.
If \ is followed by a digit from 1 to 9, followed by no additional decimal digits, interpret as a backreference. For example, `\9` always means backreference 9, even if capture 9 does not exist. If the capture does not exist, the regular expression parser throws a syntax exception.	If a single decimal digit capture exists, backreference to that digit. Otherwise, interpret as a literal.
If \ is followed by a digit from 1 to 9, followed by additional decimal digits, interpret the digits as a decimal value. If that capture exists, interpret the expression as a backreference. Otherwise, interpret the leading octal digits up to octal 377, that is, consider only the low 8 bits of the value; interpret the remaining digits as literals. For example, in the expression `\3000`, if capture 300 exists, interpret as backreference 300; if capture 300 does not exist, interpret as octal 300 followed by 0.	If \ is followed by a digit from 1 to 9, followed by any additional decimal digits, interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal using the leading octal digits up to octal 377; interpret the remaining digits as literals.

If \ is followed by 0 followed by 0 to 2 octal digits, interpret as an octal. For example, \044 always means '$'.

Same behavior.

If \ is followed by a digit from 1 to 9, followed by no additional decimal digits, interpret as a backreference. For example, \9 always means backreference 9, even if capture 9 does not exist. If the capture does not exist, the regular expression parser throws a syntax exception.

If a single decimal digit capture exists, backreference to that digit. Otherwise, interpret as a literal.

If \ is followed by a digit from 1 to 9, followed by additional decimal digits, interpret the digits as a decimal value. If that capture exists, interpret the expression as a backreference.

Otherwise, interpret the leading octal digits up to octal 377, that is, consider only the low 8 bits of the value; interpret the remaining digits as literals. For example, in the expression \3000, if capture 300 exists, interpret as backreference 300; if capture 300 does not exist, interpret as octal 300 followed by 0.

If \ is followed by a digit from 1 to 9, followed by any additional decimal digits, interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal using the leading octal digits up to octal 377; interpret the remaining digits as literals.

Share via

ECMAScript vs. Canonical Matching Behavior

See Also

Other Resources

Additional resources