You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
description: Learn how to use regular expression source generators to optimize the performance of matching algorithms in .NET.
4
4
ms.topic: conceptual
5
-
ms.date: 10/12/2022
5
+
ms.date: 10/13/2022
6
6
author: IEvangelist
7
7
ms.author: dapine
8
8
---
@@ -13,16 +13,12 @@ A regular expression, or regex, is a string that enables a developer to express
13
13
14
14
## Compiled regular expressions
15
15
16
-
When you write `new Regex("somepattern")`, a few things happen. The specified pattern is parsed, both to ensure the validity of the pattern and to transform it into an internal `RegexNode`tree that represents the parsed regex. The tree is then optimized in various ways, transforming the pattern into a functionally equivalent variation that can be more efficiently executed. The tree is written into a form that can be interpreted as a series of opcodes and operands that provide instructions to the `RegexInterpreter` engine on how to match. When a match is performed, the interpreter simply walks through those instructions, processing them against the input text. When instantiating a new `Regex` instance or calling one of the static methods on `Regex`, the interpreter is the default engine employed.
16
+
When you write `new Regex("somepattern")`, a few things happen. The specified pattern is parsed, both to ensure the validity of the pattern and to transform it into an internal tree that represents the parsed regex. The tree is then optimized in various ways, transforming the pattern into a functionally equivalent variation that can be more efficiently executed. The tree is written into a form that can be interpreted as a series of opcodes and operands that provide instructions to the regex interpreter engine on how to match. When a match is performed, the interpreter simply walks through those instructions, processing them against the input text. When instantiating a new `Regex` instance or calling one of the static methods on `Regex`, the interpreter is the default engine employed.
17
17
18
-
When you specify <xref:System.Text.RegularExpressions.RegexOptions.Compiled?displayProperty=nameWithType>, before .NET 7, all of the same construction-time work would be performed. The resulting instructions would be transformed further by the reflection-emit-based compiler into IL instructions that would be written to a few <xref:System.Reflection.Emit.DynamicMethod>s. When a match was performed, those `DynamicMethod`s would be invoked. This IL would essentially do exactly what the interpreter would do, except specialized for the exact pattern being processed. For example, if the pattern contained `[ac]`, the interpreter would see an opcode that said "match the input character at the current position against the set specified in this set description" whereas the compiled IL would contain code that effectively said, "match the input character at the current position against `'a'` or `'c'`". This special casing and the ability to perform optimizations based on knowledge of the pattern are some of the main reasons for specifying `RegexOptions.Compiled` yields much faster-matching throughput than does the interpreter.
18
+
When you specify <xref:System.Text.RegularExpressions.RegexOptions.Compiled?displayProperty=nameWithType>, all of the same construction-time work would be performed. The resulting instructions would be transformed further by the reflection-emit-based compiler into IL instructions that would be written to a few <xref:System.Reflection.Emit.DynamicMethod>s. When a match was performed, those `DynamicMethod`s would be invoked. This IL would essentially do exactly what the interpreter would do, except specialized for the exact pattern being processed. For example, if the pattern contained `[ac]`, the interpreter would see an opcode that said "match the input character at the current position against the set specified in this set description" whereas the compiled IL would contain code that effectively said, "match the input character at the current position against `'a'` or `'c'`". This special casing and the ability to perform optimizations based on knowledge of the pattern are some of the main reasons for specifying `RegexOptions.Compiled` yields much faster-matching throughput than does the interpreter.
19
19
20
20
There are several downsides to `RegexOptions.Compiled`. The most impactful is that it incurs much more construction cost than using the interpreter. Not only are all of the same costs paid as for the interpreter, but it then needs to compile that resulting `RegexNode` tree and generated opcodes/operands into IL, which adds non-trivial expense. The generated IL further needs to be JIT-compiled on first use leading to even more expense at startup. `RegexOptions.Compiled` represents a fundamental tradeoff between overheads on the first use and overheads on every subsequent use. The use of <xref:System.Reflection.Emit?displayProperty=nameWithType> also inhibits the use of `RegexOptions.Compiled` in certain environments; some operating systems don't permit dynamically generated code to be executed, and on such systems, `Compiled` will become a no-op.
21
21
22
-
To help with these issues, .NET provides a method <xref:System.Text.RegularExpressions.Regex.CompileToAssembly%2A?displayProperty=nameWithType>. This method enables the same IL that would have been generated for `RegexOptions.Compiled` to instead be written to a generated assembly on disk, and that assembly can then be referenced as a library from your app. This has the benefit of avoiding the startup overheads involved in parsing, optimizing, and outputting the IL for the expression, as that can all be done ahead of time rather than each time the app is invoked. Further, that assembly could be ahead-of-time compiled with technology like ngen or crossgen, avoiding most of the associated JIT costs as well.
23
-
24
-
`Regex.CompileToAssembly` itself has problems, however. First, it's not user-friendly. Because a utility was required to call `CompileToAssembly` to produce an assembly your app would reference, there's relatively little use for this otherwise valuable feature. On .NET [Core], `CompileToAssembly` has never been supported, as it requires the ability to save reflection-emit code to assemblies on disk, which also isn't supported. This is where source generation becomes valuable.
25
-
26
22
## Source generation
27
23
28
24
.NET 7 introduces a new `RegexGenerator` source generator. When the C# compiler was rewritten as the ["Roslyn" C# compiler](../../csharp/roslyn-sdk/index.md), it exposed object models for the entire compilation pipeline, as well as analyzers. More recently, Roslyn enabled source generators. Just like an analyzer, a source generator is a component that plugs into the compiler and is handed all of the same information as an analyzer, but in addition to being able to emit diagnostics, it can also augment the compilation unit with additional source code. The .NET 7 SDK includes a new source generator that recognizes the new <xref:System.Text.RegularExpressions.GeneratedRegexAttribute> on a partial method that returns `Regex`. The source generator provides an implementation of that method that implements all the logic for the `Regex`. For example, you might have written code like this:
@@ -67,7 +61,7 @@ But as can be seen, it's not just doing `new Regex(...)`. Rather, the source gen
67
61
:::image type="content" source="media/regular-expression-source-generators/debuggable-source.png" lightbox="media/regular-expression-source-generators/debuggable-source.png" alt-text="Debugging through source-generated Regex code":::
68
62
69
63
> [!TIP]
70
-
> In Visual Studio, select the project node in **Solution Explorer**, then expand **Dependencies** > **Analyzers** > **System.Text.RegularExpressions.Generator** > **System.Text.RegularExpressions.Generator.RegexGenerator** > _RegexGenerator.g.cs_ to see the generated C# code from this regex generator.
64
+
> In Visual Studio, right-click on your partial method declaration and select **Go To Definition**. Or, alternatively, select the project node in **Solution Explorer**, then expand **Dependencies** > **Analyzers** > **System.Text.RegularExpressions.Generator** > **System.Text.RegularExpressions.Generator.RegexGenerator** > _RegexGenerator.g.cs_ to see the generated C# code from this regex generator.
71
65
72
66
You can set breakpoints in it, you can step through it, and you can use it as a learning tool to understand exactly how the regex engine is processing your pattern with your input. The generator even generates [triple-slash (XML) comments](../../csharp/language-reference/xmldoc/index.md) to help make the expression understandable at a glance and where it's used.
73
67
@@ -398,3 +392,4 @@ When used with an option like `RegexOptions.NonBacktracking` for which the sourc
398
392
-[Compilation and Reuse in Regular Expressions](compilation-and-reuse-in-regular-expressions.md)
0 commit comments