This directory contains a Rust runner program for benchmarking Rust's regex
crate. The regex crate is authored by the same person who
created this barometer, and the barometer's early motivation was for tracking
and improving the performance of the regex crate.
This runner program makes the following decisions:
- Multiple patterns are supported by using the lower level
meta::RegexAPI fromregex-automatainstead of the APIs from the actualregexcrate. (See the section below for why we do this.) - UTF-8 mode is always disabled. This just means that matches that split a codepoint may be reported and generally doesn't have much impact on performance.
- We increase the size limit quite a bit so that the regex crate builds some of the larger regexes. This does not result in any performance improvement. It just increases a limit inside the regex compiler and permits bigger things to be constructed. We specifically do not increase the DFA size limit, which could result in faster search times by virtue of giving the DFA more space to store transitions, and thus less of a chance of clearing its cache or falling back to a slower regex engine.
The regex crate is split into two APIs. The top-level Regex is for searching
&str and the bytes::Regex API is for searching &[u8]. The only
difference between &str and &[u8] is that &str is guaranteed to be valid
UTF-8. In general, &str is Rust's standard string type (with String being
the corresponding "owned" or heap allocating string buffer).
Since &str is guaranteed to be valid UTF-8 and slicing it at offsets in the
middle of a codepoint will result in a panic, the top-level Regex value
rejects any regex pattern that could match invalid UTF-8 (which includes
matching within a codepoint).
For &[u8] though, which can contain arbitrary bytes, there is no need for
such a restriction. And indeed, that's why we use the bytes::Regex API in
this benchmark. It is the maximally flexible API, and there is generally no
performance difference between the &str and &[u8] APIs. Indeed, they both
are just thin wrappers around an internal API that is just defined on &[u8].
This runner program actually uses the regex-automata crate instead of the
regex crate. The reasons for doing this are a little ham-fisted, but roughly
as follows:
- The
regexcrate is quite literally a API wrapper aroundregex-automata'smeta::RegexAPI. Theregexcrate doesn't do any added processing. This design was specifically used so that folks could drop down to the lower level API without much fuss. - The
regexcrate API doesn't have good support for multi-pattern searching, but themeta::RegexAPI inregex-automatasupports it natively. Theregexcrate does have aRegexSetAPI, but at the time of writing, it's limited to determining whether zero or more regexes match anywhere in a haystack. It doesn't report match offsets or capture groups.meta::Regexdoes. Since some benchmarks want multi-pattern support (to square off against Hyperscan), it is really confusing to use, say, both arust/regexand arust/regex/metaengine. - In fairness to (2) above, I did actually try to maintain two distinct
engines in the benchmarks. That is,
rust/regexandrust/regex/meta. Everywhererust/regexwas used,rust/regex/metawas also used. Butrust/regex/metawas also used in the multi-pattern benchmarks, whichrust/regexcould not participate in because of API limitations. The problem with this approach is that it was really confusing to have both of these engines in the results. It's essentially inside baseball. Moreover, it made comparing Rust's regex crate with other engines more difficult than necessary, and in some cases, impossible without awkward changes to the tooling that would permit engine names to alias in certain cases. - Speed is actually the same in practice between
rust/regexandrust/regex/metafor all benchmarks in which both can participate. I actually checked this, but it also should be true since aregex::Regexis quite literally a thin wrapper aroundmeta::Regex.
What I did to test (4) above was have both rust/regex and rust/regex/meta
engines. I ran many benchmarks on both (i.e., all single pattern benchmarks). I
then compared the results. If my thesis was correct, I should observe largely
similar timings. We can check at a high level via rebar rank for search
benchmarks:
$ rebar rank record/curated/2023-04-29/*.csv --intersection -M compile -e '^rust/regex(/meta)?$'
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/regex 1.8.1 1.01 32
rust/regex/meta 0.3.0 1.01 32
And for compilation benchmarks:
$ rebar rank record/curated/2023-04-29/*.csv --intersection -m compile -e '^rust/regex(/meta)?$'
Engine Version Geometric mean of speed ratios Benchmark count
------ ------- ------------------------------ ---------------
rust/regex/meta 0.3.0 1.01 12
rust/regex 1.8.1 1.01 12
I also looked at individual benchmark results to ensure the geometric mean
wasn't giving a false impression. It wasn't. You can also look at this data
yourself too by using the data in the above commands. (It's committed to the
repository.) Just change rebar rank in the above commands to rebar cmp.
Overall, I do personally think that not actually using the regex crate in
benchmarks that purport to measure the regex crate is quite suspicious, but
I did feel like this path was the least confusing. If, in the future, the
regex crate gains a more expressive API (I'd love if RegexSet could at
least become that in a backwards compatible manner, but it's tricky) that
supports multi-pattern searches, then I'll move the runner program over to the
real regex crate APIs.
I'm open to other ideas if folks have them.