Regular expression example: Scanning for HREFs

The following example searches an input string and displays all the href="…" values and their locations in the string.

Warning

When using System.Text.RegularExpressions to process untrusted input, pass a timeout. A malicious user can provide input to RegularExpressions, causing a Denial-of-Service attack. ASP.NET Core framework APIs that use RegularExpressions pass a timeout.

The Regex object

Because the DumpHRefs method can be called multiple times from user code, it uses the static (Shared in Visual Basic) Regex.Match(String, String, RegexOptions) method. This enables the regular expression engine to cache the regular expression and avoids the overhead of instantiating a new Regex object each time the method is called. A Match object is then used to iterate through all matches in the string.

private static void DumpHRefs(string inputString)
{
    string hrefPattern = @"href\s*=\s*(?:[""'](?<1>[^""']*)[""']|(?<1>[^>\s]+))";

    try
    {
        Match regexMatch = Regex.Match(inputString, hrefPattern,
                                       RegexOptions.IgnoreCase | RegexOptions.Compiled,
                                       TimeSpan.FromSeconds(1));
        while (regexMatch.Success)
        {
            Console.WriteLine($"Found href {regexMatch.Groups[1]} at {regexMatch.Groups[1].Index}");
            regexMatch = regexMatch.NextMatch();
        }
    }
    catch (RegexMatchTimeoutException)
    {
        Console.WriteLine("The matching operation timed out.");
    }
}
Private Sub DumpHRefs(inputString As String)
    Dim hrefPattern As String = "href\s*=\s*(?:[""'](?<1>[^""']*)[""']|(?<1>[^>\s]+))"

    Try
        Dim regexMatch = Regex.Match(inputString, hrefPattern,
                                     RegexOptions.IgnoreCase Or RegexOptions.Compiled,
                                     TimeSpan.FromSeconds(1))
        Do While regexMatch.Success
            Console.WriteLine($"Found href {regexMatch.Groups(1)} at {regexMatch.Groups(1).Index}.")
            regexMatch = regexMatch.NextMatch()
        Loop
    Catch e As RegexMatchTimeoutException
        Console.WriteLine("The matching operation timed out.")
    End Try
End Sub

The following example then illustrates a call to the DumpHRefs method.

public static void Main()
{
    string inputString = "My favorite web sites include:</P>" +
                         "<A HREF=\"https://docs.microsoft.com/en-us/dotnet/\">" +
                         ".NET Documentation</A></P>" +
                         "<A HREF=\"http://www.microsoft.com\">" +
                         "Microsoft Corporation Home Page</A></P>" +
                         "<A HREF=\"https://devblogs.microsoft.com/dotnet/\">" +
                         ".NET Blog</A></P>";
    DumpHRefs(inputString);
}
// The example displays the following output:
//       Found href https://docs.microsoft.com/dotnet/ at 43
//       Found href http://www.microsoft.com at 114
//       Found href https://devblogs.microsoft.com/dotnet/ at 188
Public Sub Main()
    Dim inputString As String = "My favorite web sites include:</P>" &
                                "<A HREF=""https://docs.microsoft.com/en-us/dotnet/"">" &
                                ".NET Documentation</A></P>" &
                                "<A HREF=""http://www.microsoft.com"">" &
                                "Microsoft Corporation Home Page</A></P>" &
                                "<A HREF=""https://devblogs.microsoft.com/dotnet/"">" &
                                ".NET Blog</A></P>"
    DumpHRefs(inputString)
End Sub
' The example displays the following output:
'       Found href https://docs.microsoft.com/dotnet/ at 43
'       Found href http://www.microsoft.com at 114
'       Found href https://devblogs.microsoft.com/dotnet/ at 188

The regular expression pattern href\s*=\s*(?:["'](?<1>[^"']*)["']|(?<1>[^>\s]+)) is interpreted as shown in the following table.

Pattern Description
href Match the literal string "href". The match is case-insensitive.
\s* Match zero or more white-space characters.
= Match the equals sign.
\s* Match zero or more white-space characters.
(?: Start a non-capturing group.
["'](?<1>[^"']*)["'] Match a quotation mark or apostrophe, followed by a capturing group that matches any character other than a quotation mark or apostrophe, followed by a quotation mark or apostrophe. The group named 1 is included in this pattern.
| Boolean OR that matches either the previous expression or the next expression.
(?<1>[^>\s]+) A capturing group that uses a negated set to match any character other than a greater-than sign or a whitespace character. The group named 1 is included in this pattern.
) End the non-capturing group.

Match result class

The results of a search are stored in the Match class, which provides access to all the substrings extracted by the search. It also remembers the string being searched and the regular expression being used, so it can call the Match.NextMatch method to perform another search starting where the last one ended.

Explicitly named captures

In traditional regular expressions, capturing parentheses are automatically numbered sequentially. This leads to two problems. First, if a regular expression is modified by inserting or removing a set of parentheses, all code that refers to the numbered captures must be rewritten to reflect the new numbering. Second, because different sets of parentheses often are used to provide two alternative expressions for an acceptable match, it might be difficult to determine which of the two expressions actually returned a result.

To address these problems, the Regex class supports the syntax (?<name>…) for capturing a match into a specified slot (the slot can be named using a string or an integer; integers can be recalled more quickly). Thus, alternative matches for the same string all can be directed to the same place. In case of a conflict, the last match dropped into a slot is the successful match. (However, a complete list of multiple matches for a single slot is available. See the Group.Captures collection for details.)

See also