Skip to content

Migrate RegExp to regonaut engine#678

Draft
auvred wants to merge 4 commits intodop251:masterfrom
auvred:use-regonaut-for-regexps
Draft

Migrate RegExp to regonaut engine#678
auvred wants to merge 4 commits intodop251:masterfrom
auvred:use-regonaut-for-regexps

Conversation

@auvred
Copy link

@auvred auvred commented Sep 13, 2025

This PR replaces both https://pkg.go.dev/regexp and https://github.com/dlclark/regexp2 usages with https://github.com/auvred/regonaut (an ES2025-compatible RegExp engine).

I've tried to keep diff minimal for easier review. For now, this PR only replaces the RegExp engine and enables some TC39 tests - no new features are introduced yet.

If this PR is merged, I plan to submit a few follow-up PRs to add support for:

  • Unicode sets (v flag)
  • Named groups
  • Match Indices (d flag)

if limitValue != _undefined {
limit = int(toUint32(limitValue))

var lim int64
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

regexpproto_stdSplitter implementation is copied from the regexpproto_stdSplitterGeneric algorithm

type Mode uint

const (
IgnoreRegExpErrors Mode = 1 << iota // Ignore RegExp compatibility errors (allow backtracking)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IgnoreRegExpErrors was introduced in the initial commit (2577360). However, even back then it was only used in tests and nowhere else.

I'm not entirely sure about removing it completely, though. It has been part of the public API for 9 years. Nevertheless, I don't think anyone is actually relying on it in production code.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it would be a problem.

@dop251
Copy link
Owner

dop251 commented Sep 14, 2025

This looks very promising. When I looked into named groups I realised regexp2 was not the right fit as it's a port of the .NET library which has a very limited ECMAScript compatibility, and in order to make it fully compatible some fundamental changes are needed. I even considered forking it at some point but then I realised I didn't have enough time to make those changes and maintain the code.

I've had a quick look and my main concern so far is performance. Even running the very limited set of benchmarks from goja shows a significant degradation in both time and memory:

                                │ regexp_before.txt │           regexp_after.txt            │
                                │      sec/op       │    sec/op     vs base                 │
RegexpSplitWithBackRef-16               3.243µ ± 1%   24.268µ ± 1%  +648.45% (p=0.000 n=10)
RegexpMatch-16                          44.84µ ± 1%   280.40µ ± 1%  +525.40% (p=0.000 n=10)
RegexpMatchCache-16                     1.417m ± 1%    4.315m ± 1%  +204.40% (p=0.000 n=10)
RegexpMatchAll-16                       1.968m ± 1%    5.059m ± 1%  +157.05% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16            751.8n ± 1%   2196.0n ± 2%  +192.10% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16           1.104µ ± 1%    2.424µ ± 2%  +119.52% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16          1.383µ ± 1%    2.115µ ± 1%   +52.93% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16         1.141µ ± 2%    2.365µ ± 1%  +107.27% (p=0.000 n=10)
geomean                                 12.32µ         37.55µ       +204.77%

                                │ regexp_before.txt │           regexp_after.txt            │
                                │       B/op        │     B/op      vs base                 │
RegexpMatchCache-16                    1.336Mi ± 0%   8.958Mi ± 0%  +570.49% (p=0.000 n=10)
RegexpMatchAll-16                      1.991Mi ± 0%   9.611Mi ± 0%  +382.67% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16             759.0 ± 0%    1384.0 ± 0%   +82.35% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16          1.266Ki ± 0%   1.398Ki ± 0%   +10.49% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16           793.0 ± 0%    1304.0 ± 0%   +64.44% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16        1.273Ki ± 0%   1.320Ki ± 0%    +3.68% (p=0.000 n=10)
geomean                                11.71Ki        25.68Ki       +119.28%

                                │ regexp_before.txt │           regexp_after.txt           │
                                │     allocs/op     │  allocs/op   vs base                 │
RegexpMatchCache-16                     22.34k ± 0%   33.37k ± 0%   +49.39% (p=0.000 n=10)
RegexpMatchAll-16                       31.86k ± 0%   42.91k ± 0%   +34.66% (p=0.000 n=10)
RegexpSingleExec/Re-ASCII-16             11.00 ± 0%    22.00 ± 0%  +100.00% (p=0.000 n=10)
RegexpSingleExec/Re2-ASCII-16            18.00 ± 0%    24.00 ± 0%   +33.33% (p=0.000 n=10)
RegexpSingleExec/Re-Unicode-16           14.00 ± 0%    21.00 ± 0%   +50.00% (p=0.000 n=10)
RegexpSingleExec/Re2-Unicode-16          20.00 ± 0%    23.00 ± 0%   +15.00% (p=0.000 n=10)
geomean                                  184.5         267.4        +44.89%

Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their ^[a-z]+$ suddenly ran slower...

@auvred
Copy link
Author

auvred commented Sep 16, 2025

Do you plan to do any performance improvements? The thing is most people are not even aware of the ECMAScript regular expression quirks, but they would notice if their ^[a-z]+$ suddenly ran slower...

I've been planning to implement a second finite automata engine (alongside the existing backtracking engine) to improve performance. Unfortunately, implementing this new engine is quite time-consuming, and I'm currently limited on time. I'll make this PR a draft for now and will come back to it once the finite automata engine is finished.

@auvred auvred marked this pull request as draft September 16, 2025 11:13
@dop251
Copy link
Owner

dop251 commented Sep 24, 2025

Thanks. I'll add a couple of comments on the PR, as they would still apply...

utf16Reader() utf16Reader
utf16RuneReader() io.RuneReader
utf16Runes() []rune
toUnicode() unicodeString
Copy link
Owner

@dop251 dop251 Sep 24, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current design assumes that an ASCII string is always an asciiString, never unicodeString. There are a couple of optimisations based on this assumption (like the equality operator for example). Even though this is only used for regexp, having this method on the interface would tempt someone to misuse it at some point.

A better way would be either to have utf16() []uint16 method instead, or, use devirtualizeString.

match, result := r.execRegexp(target)
if match {
return r.execResultToArray(target, result)
targetUtf16 := target.toUnicode()
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems a little wasteful to convert every single string to unicode. A significant proportion of strings are ASCII, in some environments they are all ASCII...

@zbysir
Copy link

zbysir commented Feb 14, 2026

@auvred Awesome! I've also encountered some regex compatibility issues, and it seems your library can solve this problem well. However, dop251 also mentioned its performance concerns, so merging might be a bit challenging.

How about modifying the code to have both modes coexist: use Go build tags, keeping the existing code as the default, and allow using your code via go build -tags regonaut.

This way, we can proceed with the merge smoothly and offer your library as an experimental option for those who need it.

@auvred
Copy link
Author

auvred commented Feb 14, 2026

How about modifying the code to have both modes coexist: use Go build tags, keeping the existing code as the default, and allow using your code via go build -tags regonaut.

This way, we can proceed with the merge smoothly and offer your library as an experimental option for those who need it.

This sounds nice! If dop251 is OK with it, I can try modifying this PR to use the build tags approach.

P.S. I haven't forgotten about this PR, it's still in my TODO list, but I've been swamped with other projects. Introducing a new regexp engine is a pretty complex task and it requires a lot of time, which I currently don't have :( Also, I don't want to have an LLM implement it instead of me, because every single line of regonaut was written manually with strict adherence to the ECMAScript spec, and I don't want to violate that principle.

@zbysir
Copy link

zbysir commented Feb 14, 2026

How about modifying the code to have both modes coexist: use Go build tags, keeping the existing code as the default, and allow using your code via go build -tags regonaut.
This way, we can proceed with the merge smoothly and offer your library as an experimental option for those who need it.

This sounds nice! If dop251 is OK with it, I can try modifying this PR to use the build tags approach.

P.S. I haven't forgotten about this PR, it's still in my TODO list, but I've been swamped with other projects. Introducing a new regexp engine is a pretty complex task and it requires a lot of time, which I currently don't have :( Also, I don't want to have an LLM implement it instead of me, because every single line of regonaut was written manually with strict adherence to the ECMAScript spec, and I don't want to violate that principle.

Absolutely! No need to rush this thing. Thanks a bunch for your contribution! 😊

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants