ECMAScript vs. Canonical Matching Behavior
The behavior of ECMAScript and canonical regular expressions differs in three areas:
- Character classes are specified differently in matching expressions. Canonical regular expressions support Unicode character categories by default. ECMAScript does not support Unicode.
- A regular expression capture class with a backreference to itself must be updated with each capture iteration.
- Ambiguities between octal escapes and backreferences are treated differently.
The following table summarizes the differences in octal versus backreference interpretation by canonical and ECMAScript regular expressions.
Canonical regular expression behavior | ECMAScript behavior |
---|---|
If \ is followed by 0 followed by 0 to 2 octal digits, interpret as an octal. For example, \044 always means s. |
Same behavior. |
If \ is followed by a digit from 1 to 9, followed by no additional decimal digits, interpret as a backreference. For example, \9 always means backreference 9, even if capture 9 does not exist. If the capture does not exist, the regular expression parser throws a syntax exception. |
If a single decimal digit capture exists, backreference to that digit. Otherwise, interpret as a literal. |
If \ is followed by a digit from 1 to 9, followed by additional decimal digits, convert the digits to a decimal value. If that capture exists, interpret as a backreference. Otherwise, interpret as an octal using the leading octal digits up to value \377; interpret the remaining digits as literals. For example, for \400 , if capture 400 exists, interpret as backreference 400; if capture 400 does not exist, interpret as octal \40 followed by 0. |
If \ is followed by a digit from 1 to 9, followed by any additional decimal digits, interpret as a backreference by converting as many digits as possible to a decimal value that can refer to a capture. If no digits can be converted, interpret as an octal using the leading octal digits up to value \377; interpret the remaining digits as literals. |