.NET 正则表达式源生成器

正则表达式 (regex) 是一个字符串，它使开发人员能够表达要搜索的模式，使其成为搜索文本和提取结果作为已搜索字符串子集的一种常见方法。在 .NET 中，System.Text.RegularExpressions 命名空间用于定义 Regex 实例和静态方法，并匹配用户定义的模式。本文介绍如何使用源生成来生成 Regex 实例以优化性能。

注意

请尽可能地使用源生成的正则表达式，而不是使用 RegexOptions.Compiled 选项编译正则表达式。源生成可帮助应用更快地启动、更快地运行且更易剪裁。若要了解源生成的可用情况，请参阅其适用情况。

已编译的正则表达式

编写 new Regex("somepattern") 时，会发生一些事情。将分析指定的模式，既要确保模式的有效性，又要将其转换为表示已分析正则表达式的内部树。然后，以各种方式优化树，将模式转换为功能上等效的变体，可以更高效地执行。该树被写成可解释为一系列操作码和操作数的形式，并提供给正则表达式解析器引擎，用于指导如何进行匹配。执行匹配时，解释器只需遍历这些指令，针对输入文本处理它们。实例化新的 Regex 实例或调用 Regex 上的一个静态方法时，解释器是采用的默认引擎。

指定 RegexOptions.Compiled 时，将执行所有相同的构建时工作。生成的指令将由基于反射发出的编译器进一步转换为 IL 指令，这些指令将写入几个 DynamicMethod 对象中。执行匹配时，系统将调用这些 DynamicMethod 方法。从本质上讲，此 IL 执行的操作与解释器完全相同，只不过专门针对正在处理的确切模式。例如，如果模式包含 [ac]，则解释器将看到一个操作码，即“将当前位置的输入字符与此集合描述中指定的集合进行匹配”。而编译后的 IL 将包含这样有效的代码：“将当前位置的输入字符与 'a' 或 'c' 进行匹配”。这种特殊的大小写和基于模式知识执行优化的能力是指定 RegexOptions.Compiled 比解释器产生更快的匹配吞吐量的主要原因。

RegexOptions.Compiled 有几个缺点。构建成本高昂是最有影响力的因素。不仅要为解释器支付相同的成本，而且还需要将生成的 RegexNode 树和生成的操作码/操作数编译到 IL 中，这会增加不小的开销。生成的 IL 还需要在首次使用时进行 JIT 编译，这会导致启动时产生更多费用。 RegexOptions.Compiled 表示首次使用开销与每次后续使用开销之间的基本权衡。 System.Reflection.Emit 的使用也会抑制某些环境中 RegexOptions.Compiled 的使用；某些操作系统不允许执行动态生成的代码，并且在此类系统上，Compiled 将变为无操作。

源生成

.NET 7 引入了新的 RegexGenerator 源生成器。 源生成器是插入编译器的组件，可以用附加源代码来扩充编译单元。 .NET SDK 包含一个源生成器，可以识别返回 Regex 的分部方法的 GeneratedRegexAttribute 属性。从 .NET 9 开始，属性也可以应用于部分属性。源生成器提供该方法或属性的实现，其中包含该方法或属性的所有逻辑 Regex。例如，你之前可能编写过如下代码：

private static readonly Regex s_abcOrDefGeneratedRegex =
    new(pattern: "abc|def",
        options: RegexOptions.Compiled | RegexOptions.IgnoreCase);

private static void EvaluateText(string text)
{
    if (s_abcOrDefGeneratedRegex.IsMatch(text))
    {
        // Take action with matching text
    }
}

若要使用源生成器，请按如下方式重写上面的代码：

[GeneratedRegex("abc|def", RegexOptions.IgnoreCase, "en-US")]
private static partial Regex AbcOrDefGeneratedRegex();

private static void EvaluateText(string text)
{
    if (AbcOrDefGeneratedRegex().IsMatch(text))
    {
        // Take action with matching text
    }
}

从 .NET 9 开始，还可以将 GeneratedRegexAttribute 应用于分部属性而不是分部方法。 C# 13 的部分属性支持实现了这一功能。以下示例演示了等效属性：

[GeneratedRegex("abc|def", RegexOptions.IgnoreCase, "en-US")]
private static partial Regex AbcOrDefGeneratedRegexProperty { get; }

private static void EvaluateText(string text)
{
    if (AbcOrDefGeneratedRegexProperty.IsMatch(text))
    {
        // Take action with matching text
    }
}

提示

源生成器会忽略 RegexOptions.Compiled 标记，因此在源生成的版本中不再需要使用该标记。

生成的 AbcOrDefGeneratedRegex() 实现同样缓存 Regex 单例实例，因此使用代码时不需要额外的缓存。

下图是源生成器发出的缓存实例的屏幕截图，internal 到 Regex 子类：

缓存正则表达式静态字段

可以看出，它不仅仅在执行 new Regex(...)。相反，源生成器以 C# 代码的形式发出自定义 Regex 派生实现，其逻辑类似于 IL 中 RegexOptions.Compiled 发出的内容。可以获得 RegexOptions.Compiled 的所有吞吐量性能优势（实际上更多）和 Regex.CompileToAssembly 的启动优势，但没有 CompileToAssembly 的复杂性。您发出的源代码是项目的一部分，这意味着它可以轻松查看和调试。

提示

在 Visual Studio 中，右键单击分部方法或属性声明，然后选择“ 转到定义”。或者，也可以在“解决方案资源管理器”中选择项目节点，然后展开“依赖项”“分析器”“System.Text.RegularExpressions.Generator”“System.Text.RegularExpressions.Generator.RegexGenerator”“RegexGenerator.g.cs”，查看从此正则表达式生成器生成的 C# 代码。

可以在其中设置断点，可以逐步执行，并可以使用它作为学习工具来准确了解正则表达式引擎使用输入处理模式的方式。生成器甚至会生成三斜杠 (XML) 注释，以帮助使表达式一目了然和了解使用的位置。

在源生成的文件中

使用 .NET 7 时，源生成器和 RegexCompiler 几乎完全重写，从根本上改变了生成的代码的结构。此方法已扩展，以处理所有构造（有一个例外），RegexCompiler 和源生成器仍主要以 1:1 的方式相互映射，遵循新方法。请考虑 abc|def 表达式中某个主要函数的源生成器输出：

private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    int matchStart = pos;
    ReadOnlySpan<char> slice = inputSpan.Slice(pos);

    // Match with 2 alternative expressions, atomically.
    {
        if (slice.IsEmpty)
        {
            return false; // The input didn't match.
        }

        switch (slice[0])
        {
            case 'A' or 'a':
                if ((uint)slice.Length < 3 ||
                    !slice.Slice(1).StartsWith("bc", StringComparison.OrdinalIgnoreCase)) // Match the string "bc" (ordinal case-insensitive)
                {
                    return false; // The input didn't match.
                }

                pos += 3;
                slice = inputSpan.Slice(pos);
                break;

            case 'D' or 'd':
                if ((uint)slice.Length < 3 ||
                    !slice.Slice(1).StartsWith("ef", StringComparison.OrdinalIgnoreCase)) // Match the string "ef" (ordinal case-insensitive)
                {
                    return false; // The input didn't match.
                }

                pos += 3;
                slice = inputSpan.Slice(pos);
                break;

            default:
                return false; // The input didn't match.
        }
    }

    // The input matched.
    base.runtextpos = pos;
    base.Capture(0, matchStart, pos);
    return true;
}

源生成的代码的目标易于理解，具有易于遵循的结构，具有解释每个步骤正在执行的操作的注释，并且通常根据指导原则发出代码，即生成器应发出代码，就像人类编写代码一样。即使涉及回溯，回溯的结构也会成为代码结构的一部分，而不是依赖堆栈来指示下一步的跳转位置。例如，以下是表达式为 [ab]*[bc] 时生成的相同匹配函数的代码：

private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    int matchStart = pos;
    int charloop_starting_pos = 0, charloop_ending_pos = 0;
    ReadOnlySpan<char> slice = inputSpan.Slice(pos);

    // Match a character in the set [ABab] greedily any number of times.
    //{
        charloop_starting_pos = pos;

        int iteration = slice.IndexOfAnyExcept(Utilities.s_ascii_600000006000000);
        if (iteration < 0)
        {
            iteration = slice.Length;
        }

        slice = slice.Slice(iteration);
        pos += iteration;

        charloop_ending_pos = pos;
        goto CharLoopEnd;

        CharLoopBacktrack:

        if (Utilities.s_hasTimeout)
        {
            base.CheckTimeout();
        }

        if (charloop_starting_pos >= charloop_ending_pos ||
            (charloop_ending_pos = inputSpan.Slice(charloop_starting_pos, charloop_ending_pos - charloop_starting_pos).LastIndexOfAny(Utilities.s_ascii_C0000000C000000)) < 0)
        {
            return false; // The input didn't match.
        }
        charloop_ending_pos += charloop_starting_pos;
        pos = charloop_ending_pos;
        slice = inputSpan.Slice(pos);

        CharLoopEnd:
    //}

    // Advance the next matching position.
    if (base.runtextpos < pos)
    {
        base.runtextpos = pos;
    }

    // Match a character in the set [BCbc].
    if (slice.IsEmpty || ((uint)((slice[0] | 0x20) - 'b') > (uint)('c' - 'b')))
    {
        goto CharLoopBacktrack;
    }

    // The input matched.
    pos++;
    base.runtextpos = pos;
    base.Capture(0, matchStart, pos);
    return true;
}

可以在代码中看到回溯的结构，其中发出了一个 CharLoopBacktrack 标签（用于回溯到的位置）和一个 goto（用于在正则表达式后续部分失败时跳转到该位置）。

如果查看实现 RegexCompiler 的代码和源生成器，它们看起来非常相似：类似命名的方法、类似的调用结构，甚至整个实现中的类似注释。在大多数情况下，它们会生成相同的代码，尽管一个在 IL 中，一个在 C# 中。当然，C# 编译器负责将 C# 转换为 IL，因此这两种情况下生成的 IL 可能不相同。源生成器在各种情况下依赖 C# 编译器对各种 C# 构造进行进一步优化。因此，与 RegexCompiler 相比，源生成器将生成一些更优化的匹配代码。例如，在前面的一个示例中，可以看到发出 switch 语句的源生成器，其中一个分支用于 'a'，另一个分支用于 'b'。由于 C# 编译器非常擅长优化 switch 语句，可以使用多种策略来有效地优化语句，因此源生成器具有一种 RegexCompiler 无法进行的特殊优化。对于交替，源生成器将查看所有分支，如果可以证明每个分支都以不同的起始字符开头，它将在第一个字符上发出 switch 语句，并避免为该交替输出任何回溯代码。

private bool TryMatchAtCurrentPosition(ReadOnlySpan<char> inputSpan)
{
    int pos = base.runtextpos;
    int matchStart = pos;
    char ch;
    ReadOnlySpan<char> slice = inputSpan.Slice(pos);

    // Match with 6 alternative expressions, atomically.
    {
        int alternation_starting_pos = pos;

        // Branch 0
        {
            if ((uint)slice.Length < 6 ||
                !slice.StartsWith("monday", StringComparison.OrdinalIgnoreCase)) // Match the string "monday" (ordinal case-insensitive)
            {
                goto AlternationBranch;
            }

            pos += 6;
            slice = inputSpan.Slice(pos);
            goto AlternationMatch;

            AlternationBranch:
            pos = alternation_starting_pos;
            slice = inputSpan.Slice(pos);
        }

        // Branch 1
        {
            if ((uint)slice.Length < 7 ||
                !slice.StartsWith("tuesday", StringComparison.OrdinalIgnoreCase)) // Match the string "tuesday" (ordinal case-insensitive)
            {
                goto AlternationBranch1;
            }

            pos += 7;
            slice = inputSpan.Slice(pos);
            goto AlternationMatch;

            AlternationBranch1:
            pos = alternation_starting_pos;
            slice = inputSpan.Slice(pos);
        }

        // Branch 2
        {
            if ((uint)slice.Length < 9 ||
                !slice.StartsWith("wednesday", StringComparison.OrdinalIgnoreCase)) // Match the string "wednesday" (ordinal case-insensitive)
            {
                goto AlternationBranch2;
            }

            pos += 9;
            slice = inputSpan.Slice(pos);
            goto AlternationMatch;

            AlternationBranch2:
            pos = alternation_starting_pos;
            slice = inputSpan.Slice(pos);
        }

        // Branch 3
        {
            if ((uint)slice.Length < 8 ||
                !slice.StartsWith("thursday", StringComparison.OrdinalIgnoreCase)) // Match the string "thursday" (ordinal case-insensitive)
            {
                goto AlternationBranch3;
            }

            pos += 8;
            slice = inputSpan.Slice(pos);
            goto AlternationMatch;

            AlternationBranch3:
            pos = alternation_starting_pos;
            slice = inputSpan.Slice(pos);
        }

        // Branch 4
        {
            if ((uint)slice.Length < 6 ||
                !slice.StartsWith("fr", StringComparison.OrdinalIgnoreCase) || // Match the string "fr" (ordinal case-insensitive)
                ((((ch = slice[2]) | 0x20) != 'i') & (ch != 'İ')) || // Match a character in the set [Ii\u0130].
                !slice.Slice(3).StartsWith("day", StringComparison.OrdinalIgnoreCase)) // Match the string "day" (ordinal case-insensitive)
            {
                goto AlternationBranch4;
            }

            pos += 6;
            slice = inputSpan.Slice(pos);
            goto AlternationMatch;

            AlternationBranch4:
            pos = alternation_starting_pos;
            slice = inputSpan.Slice(pos);
        }

        // Branch 5
        {
            // Match a character in the set [Ss].
            if (slice.IsEmpty || ((slice[0] | 0x20) != 's'))
            {
                return false; // The input didn't match.
            }

            // Match with 2 alternative expressions, atomically.
            {
                if ((uint)slice.Length < 2)
                {
                    return false; // The input didn't match.
                }

                switch (slice[1])
                {
                    case 'A' or 'a':
                        if ((uint)slice.Length < 8 ||
                            !slice.Slice(2).StartsWith("turday", StringComparison.OrdinalIgnoreCase)) // Match the string "turday" (ordinal case-insensitive)
                        {
                            return false; // The input didn't match.
                        }

                        pos += 8;
                        slice = inputSpan.Slice(pos);
                        break;

                    case 'U' or 'u':
                        if ((uint)slice.Length < 6 ||
                            !slice.Slice(2).StartsWith("nday", StringComparison.OrdinalIgnoreCase)) // Match the string "nday" (ordinal case-insensitive)
                        {
                            return false; // The input didn't match.
                        }

                        pos += 6;
                        slice = inputSpan.Slice(pos);
                        break;

                    default:
                        return false; // The input didn't match.
                }
            }

        }

        AlternationMatch:;
    }

    // The input matched.
    base.runtextpos = pos;
    base.Capture(0, matchStart, pos);
    return true;
}

同时，源生成器还有其他问题需要解决，直接输出到 IL 时根本不存在这些问题。如果你回头看几个代码示例，可以看到一些大括号被奇怪地注释掉了。这不是一个错误。源生成器识别到，如果这些大括号没有被注释掉，回溯结构将依赖于从作用域外部跳转到该作用域内定义的标签；这样的标签无法被外部范围的 goto 识别，因此代码将无法编译。因此，源生成器需要避免作用域成为障碍。在某些情况下，只需像这里一样注释掉范围。在其他不可能的情况下，如果这样做会有问题，有时可能会避免需要范围的构造（例如多语句 if 块）。

源生成器处理所有 RegexCompiler 处理的内容，但有一个例外。与处理 RegexOptions.IgnoreCase 一样，现在使用大小写表在构造时生成集合。而 IgnoreCase 的后向引用匹配需要查阅该大小写表。该表是 System.Text.RegularExpressions.dll 的内部表，至少目前，该程序集的外部代码（包括源生成器发出的代码）没有访问该表的权限。这使得处理 IgnoreCase 向后引用在源生成器中成为一个挑战，并且它们不受支持。这是 RegexCompiler 支持的源生成器不支持的一个构造。如果尝试使用具有其中之一（这种情况很少见）的模式，源生成器不会发出自定义实现，而是回退到缓存常规 Regex 实例：

此外，RegexCompiler 和源生成器都不支持新的 RegexOptions.NonBacktracking。如果指定 RegexOptions.Compiled | RegexOptions.NonBacktracking，则只会忽略 Compiled 标志，如果将 NonBacktracking 指定给源生成器，它将同样回退到缓存常规 Regex 实例。

何时使用

一般指导是，如果可以使用源生成器，请使用它。如果你在当前用 C# 开发时使用 Regex，并且参数在编译时已知，尤其是如果你已经在使用 RegexOptions.Compiled（因为正则表达式已被确定为性能瓶颈而会从更高的吞吐速度中获益），那么你应该优先使用源代码生成器。源生成器将为正则表达式提供以下优势：

RegexOptions.Compiled 的所有吞吐量优势。
无需在运行时执行所有正则表达式解析、分析和编译的启动优势。
使用为正则表达式生成的代码进行预编译的选项。
更好的可调试性和对正则表达式的理解。
通过剪裁与 RegexCompiler 关联的大量代码（甚至可能包括反射产生的代码），可以减小剪裁后的应用程序的大小。

当与源生成器无法为其生成自定义实现的 RegexOptions.NonBacktracking 等选项一起使用时，它仍将发出描述实现的缓存和 XML 注释，使其很有价值。源生成器的主要缺点是它会向程序集发出其他代码，因此可能会增加大小。应用程序中的正则表达式越多，或者正则表达式越复杂，为它们生成的代码就会越多。在某些情况下，就像 RegexOptions.Compiled 可能不是必要项一样，源生成器也可能不是必要项。例如，如果你有一个仅需要很少且吞吐量无关紧要的正则表达式，那么仅依赖解释器进行零星使用可能更有益。

重要

.NET 7 包含一个分析器，用于标识可转换为源生成器的 Regex 的使用，还包含一个为你执行转换的修复程序：

请参阅

反馈

此页面是否有帮助？

Last updated on 2026-02-24

.NET 正则表达式源生成器

已编译的正则表达式

源生成

在源生成的文件中

何时使用

请参阅

反馈

其他资源