Share via


ECMAScript vs. Canonical Matching Behavior

The behavior of ECMAScript and canonical regular expressions differs in three areas:

  • Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.
  • A regular expression capture class with a backreference to itself must be updated with each capture iteration.
  • Ambiguities between octal escapes and backreferences are treated differently.

The following table summarizes the differences in octal versus backreference interpretation by canonical and ECMAScript regular expressions.

Canonical regular expression behavior ECMAScript behavior
If \ is followed by 0 followed by 0 to 2 octal digits, interpret as an octal. For example, \044 always means s. Same behavior.
If \ is followed by a digit from 1 to 9, followed by no additional decimal digits, interpret as a backreference. For example, \9 always means backreference 9, even if capture 9 does not exist. If the capture does not exist, the regular expression parser throws a syntax exception. If a single decimal digit capture exists, backreference to that digit. Otherwise, interpret as a literal.
If \ is followed by a digit from 1 to 9, followed by additional decimal digits, convert the digits to a decimal value. If that capture exists, interpret as a backreference. Otherwise, interpret as an octal using the leading octal digits up to value \377; interpret the remaining digits as literals. For example, for \400, if capture 400 exists, interpret as backreference 400; if capture 400 does not exist, interpret as octal \40 followed by 0. If \ is followed by a digit from 1 to 9, followed by any additional decimal digits, interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal using the leading octal digits up to value \377; interpret the remaining digits as literals.

See Also

Regular Expression Language Elements