Skip to content

Commit 34432af

Browse files
committed
Make regex-filtered re2 bench easier to run
Also make both re2 and regex benches run using simplified / down-compiled regexes. At this point, regex-filtered trounces FilteredRE2 on CPU *for the specific job of ua-parsing*, and its memory use has gotten quite reasonable (overhead compared to re2 is down to just 20%): ```sh > /usr/bin/time -l target/bench_re2 \ target/devices.regexes regex-filtered/samples/useragents.txt 100 -q 633 regexes 1630 atoms in 0.0200757s prefilter built in 0.00527812s 75158 user agents in 0.0353107s 49.67 real 48.88 user 0.35 sys 43958272 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 3499 page reclaims 289 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 2 voluntary context switches 28247 involuntary context switches 600560455436 instructions retired 157226396942 cycles elapsed 35309696 peak memory footprint > /usr/bin/time -l target/release/examples/bench_regex \ target/devices.regexes regex-filtered/samples/useragents.txt -r 100 -q 633 regexes in 0.051890332s 75158 user agents in 0.007460291s 38.93 real 38.52 user 0.22 sys 43958272 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 2798 page reclaims 98 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 17680 involuntary context switches 372376225085 instructions retired 123339367792 cycles elapsed 42370560 peak memory footprint ``` note: bench.rs was renamed to stop conflicting with the one in ua-parser, and make the two bench programs easier to differentiate. Also one day I need to look into the difference between maximum rss and peak memory footprint on macos. It seems weird that RSS matches between the two programs, and RSS and peak match for rust, but re2's peak is 25% lower.
1 parent 18fab27 commit 34432af

File tree

6 files changed

+147
-12
lines changed

6 files changed

+147
-12
lines changed

.gitignore

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
/target
22
.DS_Store
33
*.dSYM/
4-
regex-filtered/re2/flake.lock
5-
regex-filtered/re2/bench
4+
flake.lock
65
.tox/
6+
*/uv.lock
77
__pycache__

Makefile

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,24 @@
1+
CXXFLAGS += -std=c++20 -Wall -Werror -g -fPIC -O3
2+
LDFLAGS += -lre2
3+
4+
.PHONY: bench
5+
6+
bench: target/bench_re2 target/devices.regexes target/release/examples/bench_regex
7+
/usr/bin/time -l target/bench_re2 \
8+
target/devices.regexes regex-filtered/samples/useragents.txt 100 -q
9+
/usr/bin/time -l target/release/examples/bench_regex \
10+
target/devices.regexes regex-filtered/samples/useragents.txt -r 100 -q
11+
12+
target/bench_re2: regex-filtered/re2/bench.cpp
13+
# build re2 bench, requires re2 to be LD-able, can `nix develop` for setup
14+
@mkdir -p target
15+
$(CXX) $(CXXFLAGS) $^ -o $@ $(LDFLAGS)
16+
17+
target/release/examples/bench_regex: regex-filtered/examples/bench_regex.rs regex-filtered/src/*
18+
# build regex bench
19+
cargo build --release --example bench_regex -q
20+
21+
target/devices.regexes: scripts/devices ua-parser/uap-core/regexes.yaml
22+
# compiles regexe.yaml to a list of just the device regex (with embedded flags)
23+
@mkdir -p target
24+
uv run --script $^ > $@
File renamed without changes.

regex-filtered/re2/Makefile

Lines changed: 0 additions & 10 deletions
This file was deleted.

scripts/devices

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
#!/usr/bin/env python
2+
# /// script
3+
# requires-python = ">=3.10"
4+
# dependencies = [
5+
# "pyyaml",
6+
# ]
7+
# ///
8+
r"""Compiles regexes.yaml to just the device regexps, with rewriting:
9+
10+
- Rust's `regex` implements perl-style character classes with full
11+
unicode semantics making them much more expensive than re2's
12+
ascii-only semantics, so compile down the most frequent ones down to
13+
ascii classes.
14+
15+
regexes.yaml uses \d, \w, \s, \S, \b, and the first one is the most
16+
common by two orders of magnitude (but convert \w as well because I
17+
dun so already, converting \s might be a good idea too)
18+
19+
- Both `regex` and `re2` suffer tremendously from large bounded
20+
repetitions as they need to create a *ton* of states to keep track
21+
of the limit. This mostly affects memory consumption (and the issue
22+
compounds when captures are added to the mix), but there is a minor
23+
CPU hit as well.
24+
25+
In regexes.yaml, large bounded repetitions were introduced only to
26+
limit the risks of catastrophic backtracking in backtracking
27+
engines. Which neither re2 nor regex are.
28+
29+
So compile large bounded repetition (where heuristically "large" is
30+
3 digits in the upper bound) back to simple unbounded repetitions.
31+
Note that this is only done for a lower bound of `0` or `1`, but
32+
that's the case of all large bounded repetitions in regexes.yaml.
33+
"""
34+
import string
35+
import sys
36+
37+
from yaml import SafeLoader, load
38+
39+
def main() -> None:
40+
with open(sys.argv[1]) as f:
41+
regexes = load(f, Loader=SafeLoader)
42+
for dev in regexes['device_parsers']:
43+
print(
44+
f'(?{f})' if (f := dev.get('regex_flag')) else '',
45+
rewrite(dev['regex']),
46+
sep='',
47+
)
48+
49+
def rewrite(re: str) -> str:
50+
from_ = 0
51+
out = []
52+
it = enumerate(re)
53+
escape = False
54+
inclass = False
55+
56+
cont = True
57+
while cont and (e := next(it, None)):
58+
idx, c = e
59+
match c:
60+
case '\\' if not escape:
61+
escape = True
62+
continue
63+
case '[' if not escape:
64+
inclass = True
65+
case ']' if not escape:
66+
inclass = False
67+
case 'd' if escape:
68+
out.append(re[from_:idx-1])
69+
from_ = idx+1
70+
if inclass:
71+
out.append('0-9')
72+
else:
73+
out.append('[0-9]')
74+
case 'w' if escape:
75+
out.append(re[from_:idx-1])
76+
from_ = idx+1
77+
if inclass:
78+
out.append('A-Za-z0-9_')
79+
else:
80+
out.append('[A-Za-z0-9_]')
81+
case '{' if not escape and not inclass:
82+
if not idx:
83+
return re
84+
85+
try:
86+
_, start = next(it)
87+
except StopIteration:
88+
continue
89+
if start not in '01':
90+
continue
91+
92+
try:
93+
_, comma = next(it)
94+
except StopIteration:
95+
continue
96+
else:
97+
if comma != ',':
98+
continue
99+
100+
digits = 0
101+
for ri, rc in it:
102+
match rc:
103+
case c if c in string.digits:
104+
digits += 1
105+
case '}' if digits > 2:
106+
out.append(re[from_:idx])
107+
from_ = ri + 1
108+
out.append('*' if start == '0' else '+')
109+
case _:
110+
break
111+
case _:
112+
pass
113+
escape = False
114+
115+
if from_ == 0:
116+
return re
117+
out.append(re[from_:])
118+
return ''.join(out)
119+
120+
if __name__ == "__main__":
121+
main()

0 commit comments

Comments
 (0)