Skip to content

Commit a9310ba

Browse files
authored
Address peer feddback (dotnet#31758)
1 parent 59b86bf commit a9310ba

File tree

2 files changed

+9
-16
lines changed

2 files changed

+9
-16
lines changed

docs/standard/base-types/regular-expression-source-generators.md

Lines changed: 6 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@
22
title: ".NET regular expression source generators"
33
description: Learn how to use regular expression source generators to optimize the performance of matching algorithms in .NET.
44
ms.topic: conceptual
5-
ms.date: 10/12/2022
5+
ms.date: 10/13/2022
66
author: IEvangelist
77
ms.author: dapine
88
---
@@ -13,16 +13,12 @@ A regular expression, or regex, is a string that enables a developer to express
1313

1414
## Compiled regular expressions
1515

16-
When you write `new Regex("somepattern")`, a few things happen. The specified pattern is parsed, both to ensure the validity of the pattern and to transform it into an internal `RegexNode` tree that represents the parsed regex. The tree is then optimized in various ways, transforming the pattern into a functionally equivalent variation that can be more efficiently executed. The tree is written into a form that can be interpreted as a series of opcodes and operands that provide instructions to the `RegexInterpreter` engine on how to match. When a match is performed, the interpreter simply walks through those instructions, processing them against the input text. When instantiating a new `Regex` instance or calling one of the static methods on `Regex`, the interpreter is the default engine employed.
16+
When you write `new Regex("somepattern")`, a few things happen. The specified pattern is parsed, both to ensure the validity of the pattern and to transform it into an internal tree that represents the parsed regex. The tree is then optimized in various ways, transforming the pattern into a functionally equivalent variation that can be more efficiently executed. The tree is written into a form that can be interpreted as a series of opcodes and operands that provide instructions to the regex interpreter engine on how to match. When a match is performed, the interpreter simply walks through those instructions, processing them against the input text. When instantiating a new `Regex` instance or calling one of the static methods on `Regex`, the interpreter is the default engine employed.
1717

18-
When you specify <xref:System.Text.RegularExpressions.RegexOptions.Compiled?displayProperty=nameWithType>, before .NET 7, all of the same construction-time work would be performed. The resulting instructions would be transformed further by the reflection-emit-based compiler into IL instructions that would be written to a few <xref:System.Reflection.Emit.DynamicMethod>s. When a match was performed, those `DynamicMethod`s would be invoked. This IL would essentially do exactly what the interpreter would do, except specialized for the exact pattern being processed. For example, if the pattern contained `[ac]`, the interpreter would see an opcode that said "match the input character at the current position against the set specified in this set description" whereas the compiled IL would contain code that effectively said, "match the input character at the current position against `'a'` or `'c'`". This special casing and the ability to perform optimizations based on knowledge of the pattern are some of the main reasons for specifying `RegexOptions.Compiled` yields much faster-matching throughput than does the interpreter.
18+
When you specify <xref:System.Text.RegularExpressions.RegexOptions.Compiled?displayProperty=nameWithType>, all of the same construction-time work would be performed. The resulting instructions would be transformed further by the reflection-emit-based compiler into IL instructions that would be written to a few <xref:System.Reflection.Emit.DynamicMethod>s. When a match was performed, those `DynamicMethod`s would be invoked. This IL would essentially do exactly what the interpreter would do, except specialized for the exact pattern being processed. For example, if the pattern contained `[ac]`, the interpreter would see an opcode that said "match the input character at the current position against the set specified in this set description" whereas the compiled IL would contain code that effectively said, "match the input character at the current position against `'a'` or `'c'`". This special casing and the ability to perform optimizations based on knowledge of the pattern are some of the main reasons for specifying `RegexOptions.Compiled` yields much faster-matching throughput than does the interpreter.
1919

2020
There are several downsides to `RegexOptions.Compiled`. The most impactful is that it incurs much more construction cost than using the interpreter. Not only are all of the same costs paid as for the interpreter, but it then needs to compile that resulting `RegexNode` tree and generated opcodes/operands into IL, which adds non-trivial expense. The generated IL further needs to be JIT-compiled on first use leading to even more expense at startup. `RegexOptions.Compiled` represents a fundamental tradeoff between overheads on the first use and overheads on every subsequent use. The use of <xref:System.Reflection.Emit?displayProperty=nameWithType> also inhibits the use of `RegexOptions.Compiled` in certain environments; some operating systems don't permit dynamically generated code to be executed, and on such systems, `Compiled` will become a no-op.
2121

22-
To help with these issues, .NET provides a method <xref:System.Text.RegularExpressions.Regex.CompileToAssembly%2A?displayProperty=nameWithType>. This method enables the same IL that would have been generated for `RegexOptions.Compiled` to instead be written to a generated assembly on disk, and that assembly can then be referenced as a library from your app. This has the benefit of avoiding the startup overheads involved in parsing, optimizing, and outputting the IL for the expression, as that can all be done ahead of time rather than each time the app is invoked. Further, that assembly could be ahead-of-time compiled with technology like ngen or crossgen, avoiding most of the associated JIT costs as well.
23-
24-
`Regex.CompileToAssembly` itself has problems, however. First, it's not user-friendly. Because a utility was required to call `CompileToAssembly` to produce an assembly your app would reference, there's relatively little use for this otherwise valuable feature. On .NET [Core], `CompileToAssembly` has never been supported, as it requires the ability to save reflection-emit code to assemblies on disk, which also isn't supported. This is where source generation becomes valuable.
25-
2622
## Source generation
2723

2824
.NET 7 introduces a new `RegexGenerator` source generator. When the C# compiler was rewritten as the ["Roslyn" C# compiler](../../csharp/roslyn-sdk/index.md), it exposed object models for the entire compilation pipeline, as well as analyzers. More recently, Roslyn enabled source generators. Just like an analyzer, a source generator is a component that plugs into the compiler and is handed all of the same information as an analyzer, but in addition to being able to emit diagnostics, it can also augment the compilation unit with additional source code. The .NET 7 SDK includes a new source generator that recognizes the new <xref:System.Text.RegularExpressions.GeneratedRegexAttribute> on a partial method that returns `Regex`. The source generator provides an implementation of that method that implements all the logic for the `Regex`. For example, you might have written code like this:
@@ -44,14 +40,12 @@ private static void EvaluateText(string text)
4440
You can now rewrite the previous code as follows:
4541

4642
```csharp
47-
private static readonly Regex s_abcOrDefGeneratedRegex = AbcOrDefGeneratedRegex();
48-
4943
[GeneratedRegex("abc|def", RegexOptions.IgnoreCase | RegexOptions.Compiled, "en-US")]
5044
private static partial Regex AbcOrDefGeneratedRegex();
5145

5246
private static void EvaluateText(string text)
5347
{
54-
if (s_abcOrDefGeneratedRegex.IsMatch(text))
48+
if (AbcOrDefGeneratedRegex().IsMatch(text))
5549
{
5650
// Take action with matching text
5751
}
@@ -67,7 +61,7 @@ But as can be seen, it's not just doing `new Regex(...)`. Rather, the source gen
6761
:::image type="content" source="media/regular-expression-source-generators/debuggable-source.png" lightbox="media/regular-expression-source-generators/debuggable-source.png" alt-text="Debugging through source-generated Regex code":::
6862

6963
> [!TIP]
70-
> In Visual Studio, select the project node in **Solution Explorer**, then expand **Dependencies** > **Analyzers** > **System.Text.RegularExpressions.Generator** > **System.Text.RegularExpressions.Generator.RegexGenerator** > _RegexGenerator.g.cs_ to see the generated C# code from this regex generator.
64+
> In Visual Studio, right-click on your partial method declaration and select **Go To Definition**. Or, alternatively, select the project node in **Solution Explorer**, then expand **Dependencies** > **Analyzers** > **System.Text.RegularExpressions.Generator** > **System.Text.RegularExpressions.Generator.RegexGenerator** > _RegexGenerator.g.cs_ to see the generated C# code from this regex generator.
7165
7266
You can set breakpoints in it, you can step through it, and you can use it as a learning tool to understand exactly how the regex engine is processing your pattern with your input. The generator even generates [triple-slash (XML) comments](../../csharp/language-reference/xmldoc/index.md) to help make the expression understandable at a glance and where it's used.
7367

@@ -398,3 +392,4 @@ When used with an option like `RegexOptions.NonBacktracking` for which the sourc
398392
- [Compilation and Reuse in Regular Expressions](compilation-and-reuse-in-regular-expressions.md)
399393
- [Source Generators](../../csharp/roslyn-sdk/source-generators-overview.md)
400394
- [Tutorial: Debug a .NET console application using Visual Studio](../../core/tutorials/debugging-with-visual-studio.md)
395+
- [.NET Blog: Regular Expression improvements in .NET 7](https://devblogs.microsoft.com/dotnet/regular-expression-improvements-in-dotnet-7)

docs/standard/base-types/snippets/regular-expression-source-generators/Program.cs

Lines changed: 3 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -3,8 +3,6 @@
33

44
static partial class Program
55
{
6-
private static readonly Regex s_abcOrDefGeneratedRegex = AbcOrDefGeneratedRegex();
7-
86
[GeneratedRegex(
97
pattern: "abc|def",
108
options: RegexOptions.IgnoreCase | RegexOptions.Compiled,
@@ -13,16 +11,16 @@ static partial class Program
1311

1412
private static void EvaluateText(string text)
1513
{
16-
if (s_abcOrDefGeneratedRegex.IsMatch(text))
14+
if (AbcOrDefGeneratedRegex().IsMatch(text))
1715
{
1816
Console.WriteLine($"""
19-
✅ "{text}" matches "{s_abcOrDefGeneratedRegex}" pattern.
17+
✅ "{text}" matches "{AbcOrDefGeneratedRegex()}" pattern.
2018
""");
2119
}
2220
else
2321
{
2422
Console.WriteLine($"""
25-
❌ "{text}" doesn't match "{s_abcOrDefGeneratedRegex}" pattern.
23+
❌ "{text}" doesn't match "{AbcOrDefGeneratedRegex()}" pattern.
2624
""");
2725
}
2826
}

0 commit comments

Comments
 (0)