Skip to content

Improve atoms extraction from alternations#245

Merged
vthib merged 12 commits intomasterfrom
improve-atoms-extraction-from-alternations
Nov 20, 2025
Merged

Improve atoms extraction from alternations#245
vthib merged 12 commits intomasterfrom
improve-atoms-extraction-from-alternations

Conversation

@vthib
Copy link
Owner

@vthib vthib commented Nov 20, 2025

This MR brings a much needed improvement in the atoms extraction
logic so that several rules used in the wild that had performance
issues can be properly handled.

The main idea of the change is to be able to join alternates to
surrounding bytes while allowing the alternates to contain jumps.

The idea is to split alternates into an run open on the left, and
a run open on the right. For example, for this regex:

abc.def.ghi

  • the "abc" part is open on the left
  • the "def" part is closed in the middle
  • the "ghi" part is open on the right

Lets see this in an alternations:

  • a(1.2.3|4.5.6)z

This can be split as follows:

  • (..|..)

There are three possible atoms from this:

  •  (ie ["a1", "a4"] in the example)
  • (ie ["2", "5"] in the example)
  • (ie ["3z", "6z"] in the example)

This is simple in theory, but not that trivial to handle, especially
as the regex must also be split into a reverse and a forward validator
from the position of the extracted literals.

One illustration of this is for example that the "" extraction
is actually impossible: since there is nothing linking the reverse and
forward part from the literal, false positives could be generated.
One such example would be the regex "1.2.3|4.2.5", and the extracted
literal "2".

To properly handle the generation of the reverse and forward validator,
the position of the literals inside the alternation must be reused
properly to split the alternation at the right places for the validators
to function. This requires a rework of this logic which makes it a
bit more complex, but this stays fairly manageable.

vthib added 12 commits November 20, 2025 23:53
Doing so is never worth it and complicates the literals
extraction combinations, so just remove this feature.
Instead of splitting each byte into hits own part, join
them so that a part is a run of literals. This will be helpful
in future commits.
Instead of merging each completed run inside to extractor
into a list of literals + quality, separate the concerns better:
the extractor simply extracts the different runs out of the HIR,
then another logic will compute the best run.
Penalize harder bad bytes, but remove space from the list.
This should improve the quality of extracted atoms.
If two literals have the same atoms quality, prefer the longer
one: validating the literal is the first step of validating a match,
and picking a longer literal means a high chance to detect false
positive during this step.
When iterating over subslices of components that are used to
generate literals, the loop should go to the next component
only when it is no longer involved in the atom computation.
This was not correctly computed before.
Instead of comparing the end_position of a hir part to the last
position of the visitor, use None to mark a part that is at the
end. This is more work to maintain during the visit, but will
simplify greatly the pre post extractor for the upcoming changes.
Instead of extracting both the pre part and the post part, only
due a single of those things in a visit. This keeps the code is
simpler and will be useful in the next commits where we want to
extract only the pre or post but not both.
This MR brings a much needed improvement in the atoms extraction
logic so that several rules used in the wild that had performance
issues can be properly handled.

The main idea of the change is to be able to join alternates to
surrounding bytes while allowing the alternates to contain jumps.

The idea is to split alternates into an run open on the left, and
a run open on the right. For example, for this regex:

abc.def.ghi

- the "abc" part is open on the left
- the "def" part is closed in the middle
- the "ghi" part is open on the right

Lets see this in an alternations:

- a(1.2.3|4.5.6)z

This can be split as follows:

- <prefix>(<pre1>.<mid1>.<post1>|<pre2>.<mid2>.<post2>)<suffix>

There are three possible atoms from this:

- <prefix><pre> (ie ["a1", "a4"] in the example)
- <mid> (ie ["2", "5"] in the example)
- <post><suffix> (ie ["3z", "6z"] in the example)

This is simple in theory, but not that trivial to handle, especially
as the regex must also be split into a reverse and a forward validator
from the position of the extracted literals.

One illustration of this is for example that the "<mid>" extraction
is actually impossible: since there is nothing linking the reverse and
forward part from the literal, false positives could be generated.
One such example would be the regex "1.2.3|4.2.5", and the extracted
literal "2".

To properly handle the generation of the reverse and forward validator,
the position of the literals inside the alternation must be reused
properly to split the alternation at the right places for the validators
to function. This requires a rework of this logic which makes it a
bit more complex, but this stays fairly manageable.
@codecov
Copy link

codecov bot commented Nov 20, 2025

Codecov Report

❌ Patch coverage is 99.80431% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 98.34%. Comparing base (45fb70e) to head (c7a73c5).
⚠️ Report is 12 commits behind head on master.

Files with missing lines Patch % Lines
boreal/src/matcher/literals.rs 99.80% 1 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##           master     #245    +/-   ##
========================================
  Coverage   98.34%   98.34%            
========================================
  Files          95       95            
  Lines       26762    27027   +265     
========================================
+ Hits        26319    26580   +261     
- Misses        443      447     +4     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@vthib vthib merged commit b3de022 into master Nov 20, 2025
22 checks passed
@vthib vthib deleted the improve-atoms-extraction-from-alternations branch November 20, 2025 23:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant