Backreferences
A backreference identifies the substring captured by a matched group in a regular expression. Each backreference is identified by a number or a name, and is referred to by the notation "\number" or "\k<name>". For example, if the input string contains multiple occurrences of an arbitrary substring, you can match the first occurrence with a capture group, then use a backreference to match subsequent occurrences of the substring. For more information, see Backreference Constructs and Grouping Constructs.
Backreferences provide a convenient way to find repeating groups of characters. They can be thought of as a shorthand instruction to match the same string again. For instance, to find repeating adjacent characters such as the two Ls in the word "tall", you would use the regular expression (?<char>\w)\k<char>
, which uses the metacharacter \w
to find any single-word character. The grouping construct (?<char> )
encloses the metacharacter to force the regular expression engine to remember a subexpression match (which in this case will be any single character) and save it under the name "char". The backreference construct \k<char>
causes the engine to compare the current character to the previously matched character stored under "char". The entire regular expression successfully finds a match wherever a single character is the same as the preceding character.
To find repeating whole words, you can modify the grouping subexpression to search for any group of characters preceded by a space instead of simply searching for any single character. You can substitute the subexpression \w+
, which matches any group of characters, for the metacharacter \w
and use the metacharacter \s
to match a space preceding the character group. This yields the regular expression (?<char>\s\w+)\k<char>
, which finds any repeating whole words such as " the the" but also matches other repetitions of the specified string, as in the phrase "the theory."
To verify that the second match is on a word boundary, add the metacharacter \b
after the repeat match. The resulting regular expression, (?<char>\s\w+)\k<char>\b
, finds only repeating whole words that are preceded by white space.
Parsing Backreferences
The expressions \1
through \9
always refer to backreferences, not octal codes. Multidigit expressions \11
and up are considered backreferences if there is a backreference corresponding to that number; otherwise, they are interpreted as octal codes (unless the starting digits are 8 or 9, in which case they are treated as literal "8" and "9"). If a regular expression contains a backreference to an undefined group number, it is considered a parsing error. If the ambiguity is a problem, you can use the \k<n>
notation, which is unambiguous and cannot be confused with octal character codes; similarly, hexadecimal codes such as \xdd
are unambiguous and cannot be confused with backreferences.
Backreference behavior is slightly different when the ECMAScript option flag is enabled. For more information, see ECMAScript vs. Canonical Matching Behavior.
Matching Backreferences
A backreference refers to the most recent definition of a group (the definition most immediately to the left, when matching left to right). Specifically, when a group makes multiple captures, a backreference refers to the most recent capture. For example, (?<1>a)(?<1>\1b)*
matches aababb, with the capturing pattern (a)(ab)(abb)
. Looping quantifiers do not clear group definitions.
If a group has not captured any substring, a backreference to that group is undefined and never matches. For example, the expression \1()
never matches anything, but the expression ()\1
matches the empty string.