Improve atoms extraction from alternations#245
Merged
Conversation
Doing so is never worth it and complicates the literals extraction combinations, so just remove this feature.
Instead of splitting each byte into hits own part, join them so that a part is a run of literals. This will be helpful in future commits.
Instead of merging each completed run inside to extractor into a list of literals + quality, separate the concerns better: the extractor simply extracts the different runs out of the HIR, then another logic will compute the best run.
Penalize harder bad bytes, but remove space from the list. This should improve the quality of extracted atoms.
If two literals have the same atoms quality, prefer the longer one: validating the literal is the first step of validating a match, and picking a longer literal means a high chance to detect false positive during this step.
When iterating over subslices of components that are used to generate literals, the loop should go to the next component only when it is no longer involved in the atom computation. This was not correctly computed before.
Instead of comparing the end_position of a hir part to the last position of the visitor, use None to mark a part that is at the end. This is more work to maintain during the visit, but will simplify greatly the pre post extractor for the upcoming changes.
Instead of extracting both the pre part and the post part, only due a single of those things in a visit. This keeps the code is simpler and will be useful in the next commits where we want to extract only the pre or post but not both.
This MR brings a much needed improvement in the atoms extraction logic so that several rules used in the wild that had performance issues can be properly handled. The main idea of the change is to be able to join alternates to surrounding bytes while allowing the alternates to contain jumps. The idea is to split alternates into an run open on the left, and a run open on the right. For example, for this regex: abc.def.ghi - the "abc" part is open on the left - the "def" part is closed in the middle - the "ghi" part is open on the right Lets see this in an alternations: - a(1.2.3|4.5.6)z This can be split as follows: - <prefix>(<pre1>.<mid1>.<post1>|<pre2>.<mid2>.<post2>)<suffix> There are three possible atoms from this: - <prefix><pre> (ie ["a1", "a4"] in the example) - <mid> (ie ["2", "5"] in the example) - <post><suffix> (ie ["3z", "6z"] in the example) This is simple in theory, but not that trivial to handle, especially as the regex must also be split into a reverse and a forward validator from the position of the extracted literals. One illustration of this is for example that the "<mid>" extraction is actually impossible: since there is nothing linking the reverse and forward part from the literal, false positives could be generated. One such example would be the regex "1.2.3|4.2.5", and the extracted literal "2". To properly handle the generation of the reverse and forward validator, the position of the literals inside the alternation must be reused properly to split the alternation at the right places for the validators to function. This requires a rework of this logic which makes it a bit more complex, but this stays fairly manageable.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #245 +/- ##
========================================
Coverage 98.34% 98.34%
========================================
Files 95 95
Lines 26762 27027 +265
========================================
+ Hits 26319 26580 +261
- Misses 443 447 +4 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This MR brings a much needed improvement in the atoms extraction
logic so that several rules used in the wild that had performance
issues can be properly handled.
The main idea of the change is to be able to join alternates to
surrounding bytes while allowing the alternates to contain jumps.
The idea is to split alternates into an run open on the left, and
a run open on the right. For example, for this regex:
abc.def.ghi
Lets see this in an alternations:
This can be split as follows:
There are three possible atoms from this:
This is simple in theory, but not that trivial to handle, especially
as the regex must also be split into a reverse and a forward validator
from the position of the extracted literals.
One illustration of this is for example that the "" extraction
is actually impossible: since there is nothing linking the reverse and
forward part from the literal, false positives could be generated.
One such example would be the regex "1.2.3|4.2.5", and the extracted
literal "2".
To properly handle the generation of the reverse and forward validator,
the position of the literals inside the alternation must be reused
properly to split the alternation at the right places for the validators
to function. This requires a rework of this logic which makes it a
bit more complex, but this stays fairly manageable.