feat: add pcre2 as optional feature by wheynelau · Pull Request #1959 · huggingface/tokenizers

wheynelau · 2026-03-02T17:37:41Z

Motivation: Exploring performance profiling and noticed onig showing up in the profiles and tried swapping for pcre2. Happy to get some feedback - I'm not deeply familiar with the tradeoffs.

I have validated that all tests pass and the benchmarks shows that its better for GPT2 and Llama3 models:

Benchmark	main (onig)	pcre2
bpe-encode/BPE GPT2 encode	1705.7±19.00ms 3.6 MB/sec	1422.8±10.89ms 4.3 MB/sec
llama3-encode/llama3-encode	1912.2±21.94ms 3.2 MB/sec	1601.6±5.81ms 3.9 MB/sec
bpe-encode/BPE GPT2 encode, no cache	2.5±0.04s 2.5 MB/sec	2.1±0.02s 2.9 MB/sec
llama3-encode/llama3-offsets	257.9±7.03ms 24.0 MB/sec	240.8±3.05ms 25.7 MB/sec
llama3-encode/llama3-batch	340.9±5.99ms 18.2 MB/sec	319.3±3.05ms 19.4 MB/sec

Commands used:

cargo bench --no-default-features --features onig,progressbar,esaxx_fast -- --save-baseline main
cargo bench --no-default-features --features pcre2,progressbar,esaxx_fast -- --save-baseline pcre2

Based on perf these were my CPU samples:

func	onig	pcre2
NormalizedString::split	5.58%	1.44%

feat: add pcre2 as feature

b7b337c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add pcre2 as optional feature#1959

feat: add pcre2 as optional feature#1959
wheynelau wants to merge 1 commit intohuggingface:mainfrom
wheynelau:perf-pcre2

wheynelau commented Mar 2, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

wheynelau commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

wheynelau commented Mar 2, 2026 •

edited

Loading