Use an atom length of 2 for the regex filtering #31

masklinn · 2025-05-11T09:53:26Z

As it turns out a significant number of regexes have distinguishing atoms of length 2 rather than 3, leading to significant under-performing prefiltering using default settings e.g. when parsing sample 9997 (sort -u of sample file), the default setting prefilter from 633 to 61 regexes, of which the matching regex is number 50, leading to a lot of Regex::is_match.

Looking at the "extra" regexes, while they do have pretty long atoms those tend to be optional, the only required atoms are very short. By reducing the atom length to 2, the prefiltered set goes down to 20, of which the regex we're looking for is 14th. This cuts down the post-prefiltering filtering from 6µs to 2 (in addition to a 2µs prefiltering but that doesn't change much, it goes from 2.2 to 2.3).

This leads to a 15% perf increase on the benchmark, at no visible memory cost (maximum RSS and peak footprint are lost in noise), before:

Lines: 751580
Total time: 8.139572291s
10µs / line
        8.25 real         8.21 user         0.03 sys
            57655296  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                3732  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                  85  involuntary context switches
         74982477832  instructions retired
         26557964231  cycles elapsed
            54461952  peak memory footprint

after:

Lines: 751580
Total time: 6.797529459s
9µs / line
        6.91 real         6.86 user         0.04 sys
            57802752  maximum resident set size
                   0  average shared memory size
                   0  average unshared data size
                   0  average unshared stack size
                3741  page reclaims
                   0  page faults
                   0  swaps
                   0  block input operations
                   0  block output operations
                   0  messages sent
                   0  messages received
                   0  signals received
                   0  voluntary context switches
                 154  involuntary context switches
         65652792138  instructions retired
         22207284899  cycles elapsed
            54478080  peak memory footprint

As it turns out a *significant* number of regexes have distinguishing atoms of length 2 rather than 3, leading to significant under-performing prefiltering using default settings e.g. when parsing sample 9997 (`sort -u` of sample file), the default setting prefilter from 633 to 61 regexes, of which the matching regex is number 50, leading to a lot of `Regex::is_match`. Looking at the "extra" regexes, while they do have pretty long atoms those tend to be optional, the only required atoms are very short. By reducing the atom length to 2, the prefiltered set goes down to 20, of which the regex we're looking for is 14th. This cuts down the post-prefiltering filtering from 6µs to 2 (in addition to a 2µs prefiltering but that doesn't change much, it goes from 2.2 to 2.3). This leads to a 15% perf increase on the benchmark, at no visible memory cost (maximum RSS and peak footprint are lost in noise), before: Lines: 751580 Total time: 8.139572291s 10µs / line 8.25 real 8.21 user 0.03 sys 57655296 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 3732 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 85 involuntary context switches 74982477832 instructions retired 26557964231 cycles elapsed 54461952 peak memory footprint after: Lines: 751580 Total time: 6.797529459s 9µs / line 6.91 real 6.86 user 0.04 sys 57802752 maximum resident set size 0 average shared memory size 0 average unshared data size 0 average unshared stack size 3741 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received 0 voluntary context switches 154 involuntary context switches 65652792138 instructions retired 22207284899 cycles elapsed 54478080 peak memory footprint

Trying it out confirms ua-parser#31, and the better introspectivity of FilteredRE2 explains why: turns out the data set has a pretty small number of atoms of length 2 with high discriminatory power. Lowering length to 2 increases the number of atoms from 1630 to just 1865 (+235, +14.4%) which explains why memory use is unaffected or even goes down (some regexes which match none of the samples are likely not even tried anymore) but performances increase *dramatically* (48s -> 27s for re2, 38s -> 24s for regex). This makes sense as devices are also where ua-parser#31 got extreme bang for its buck. It's a bit sad seeing re2 catch up so much with our hard work, but it makes sense if we assume `regex` has a more optimised regex matching at the cost of memory: with better discrimination we drastically decrease the amount of regex matching, which benefits the package with the slower regex matching. Although to be fair the re2 bench could also be slower due to the use of an `re2::Set` instead of an aho-corasick automaton. In fact that's pretty likely. However effect seems non-existent to slightly negative for UA and OS: - At 3-atoms, UAs have 849 atoms for 362 regex, and both re2 and regex run in about 10s (9.70~9.90 real), interestingly the RSS and memory footprint of regex are a lot lower there (25MB to 32~33 footprint). - At 2-atoms, UAs have 874 atoms for 362 regex, and both re2 and regex run a bit slower, around 10.50 for re2 and 10.40 for regex, memory use is the same. - OS is basically inbetween, going from 3-atoms to 2-atoms the number of atoms increases a small hair from 353 to 359 (for 201 regexes), the re2 performances remain stable (8.15~8.40) while regex seems to decrease a hair (from 7.10~7.20 to 7.60~7.70). Note that this is all over 100 runs parsing 75158 user agents. But that hints that maybe different configurations for the ua and device parsers would make sense... Fixes ua-parser#30

Trying it out confirms #31, and the better introspectivity of FilteredRE2 explains why: turns out the data set has a pretty small number of atoms of length 2 with high discriminatory power. Lowering length to 2 increases the number of atoms from 1630 to just 1865 (+235, +14.4%) which explains why memory use is unaffected or even goes down (some regexes which match none of the samples are likely not even tried anymore) but performances increase *dramatically* (48s -> 27s for re2, 38s -> 24s for regex). This makes sense as devices are also where #31 got extreme bang for its buck. It's a bit sad seeing re2 catch up so much with our hard work, but it makes sense if we assume `regex` has a more optimised regex matching at the cost of memory: with better discrimination we drastically decrease the amount of regex matching, which benefits the package with the slower regex matching. Although to be fair the re2 bench could also be slower due to the use of an `re2::Set` instead of an aho-corasick automaton. In fact that's pretty likely. However effect seems non-existent to slightly negative for UA and OS: - At 3-atoms, UAs have 849 atoms for 362 regex, and both re2 and regex run in about 10s (9.70~9.90 real), interestingly the RSS and memory footprint of regex are a lot lower there (25MB to 32~33 footprint). - At 2-atoms, UAs have 874 atoms for 362 regex, and both re2 and regex run a bit slower, around 10.50 for re2 and 10.40 for regex, memory use is the same. - OS is basically inbetween, going from 3-atoms to 2-atoms the number of atoms increases a small hair from 353 to 359 (for 201 regexes), the re2 performances remain stable (8.15~8.40) while regex seems to decrease a hair (from 7.10~7.20 to 7.60~7.70). Note that this is all over 100 runs parsing 75158 user agents. But that hints that maybe different configurations for the ua and device parsers would make sense... Fixes #30

Given ua-parser/uap-rust#29 and ua-parser/uap-rust#31, the wording of the comparison needs to be updated to account for: - The `regex` memory use being much improved. - The `regex` runtime on devices being slightly improved, with the Python interface to `re2` not supporting custom atom lengths. Closes ua-parser#264

Given ua-parser/uap-rust#29 and ua-parser/uap-rust#31, the wording of the comparison needs to be updated to account for: - The `regex` memory use being much improved. - The `regex` runtime on devices being slightly improved, with the Python interface to `re2` not supporting custom atom lengths. Closes #264

masklinn mentioned this pull request May 11, 2025

Investigate updating the re2 matcher to use an atom length of 2 ua-parser/uap-python#263

Closed

masklinn enabled auto-merge (rebase) May 11, 2025 09:56

masklinn merged commit 18fab27 into ua-parser:main May 11, 2025
16 checks passed

masklinn deleted the reduce-atom-length branch May 11, 2025 09:57

This was referenced May 11, 2025

Update resolvers/regex doc section (once a new release of ua-parser-rs has been cut) ua-parser/uap-python#264

Closed

Benchmark suite(s) #34

Open

masklinn mentioned this pull request Jun 9, 2025

Update wording of resolvers guide ua-parser/uap-python#269

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Use an atom length of 2 for the regex filtering #31

Use an atom length of 2 for the regex filtering #31

Uh oh!

masklinn commented May 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Use an atom length of 2 for the regex filtering #31

Use an atom length of 2 for the regex filtering #31

Uh oh!

Conversation

masklinn commented May 11, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant