-
Notifications
You must be signed in to change notification settings - Fork 4
Description
1af15c1 added an re2 bench, and at the time regex-filtered was significantly slower and way more memory hungry. This led to both fixes (4f1c7df, f500c57) and investigations (rust-lang/regex#1206) leading to more fixes (29b9195) and fixes to the fixes (#29).
At this point regex-filtered looks like it'd be quite close to memory parity (certainly a lot more than 3x) compared to re2 but I've not run the comparison head to head in a while.
This is including regex simplification, but I think that is warranted because re2 uses ASCII interpretations of perl-style character classes whereas regex uses a much costlier unicode interpretation so without regex downgrade / translation we're comparing apples to marine engines.
Although the bounded repetition conversion needs to be implemented for the re2 bench to make the comparison fair as it does suffer from that issue as well.
Finally, I should probably go and report the perl-style unicode issue to burntsushi again, while I think full unicode interpretation is a good default,
- manually converting character classes is not trivial and adds its own overhead (maybe minimal compared to regex construction itself but still)
- other engines use ASCII (only) character classes which kinda makes comparison to them unfair (JS, re2, golang)
-uis not really a solution because it makes all matching be byte-wise, notably.
Could also make sense to create an other regex extension crate which does this simplification (optionally for bounded repetition I guess).