perf(byte_level): port GPT-2 split regex to logos FSM (−22% on GPT-2 encode) by ArthurZucker · Pull Request #2031 · huggingface/tokenizers

ArthurZucker · 2026-04-23T08:00:03Z

Summary

Replaces the SysRegex (onig / fancy-regex) backing the GPT-2 split pattern in ByteLevel::pre_tokenize with a compile-time DFA generated by logos. Gated behind a new logos-pretok Cargo feature, default on; the legacy SysRegex path stays compiled under cfg(not(feature = "logos-pretok")) so downstream users can flip back with a single feature change during the bake-in.

The regex being replaced (src/pre_tokenizers/byte_level.rs:43):

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

Ran in every encode for GPT-2 / every byte-level tokenizer. Dispatches via FFI to onig (C lib) because of the (?!\S) negative lookahead and \p{L}/\p{N} classes. Replacing it with a static DFA removes the per-call VM overhead and the FFI boundary.

Equivalence

Logos can't express lookahead, so the \s+(?!\S) semantics are replayed in a post-processing pass inside LogosByteLevel::find_matches:

A whitespace run of char-length ≥ 2 followed by a content token gets shrunk by one char, and that char is given to the next span as a · ? leading-space prefix.
Contractions are the subtle case: logos uses longest-match so 't (2 chars) beats [^\s\p{L}\p{N}]+ (1 char) at the same position, whereas the legacy leftmost-first regex — after \s+(?!\S) backtracks — prefers the earlier [^...]+ alternative and consumes · ' as a 2-char Other span. The post-processing pass detects Whitespace(≥2) → Contraction and splits the contraction into Other(') + Letters(rest), extends Other to claim the freed ws char, and merges Letters(rest) into the following contiguous Letters span when appropriate. See the inline comment in byte_level.rs and the big_txt_corpus equivalence test for how this was found.

New integration test (tokenizers/tests/logos_bytelevel_equivalence.rs) drives both Pattern impls side-by-side:

Adversarial table: every contraction literal, trailing ws at EOF, multi-char ws runs that trigger the backtrack, Unicode letters/numbers/combining marks.
Corpus sweep: every line of data/big.txt (≈130k lines). Both produce byte-for-byte identical Vec<((usize, usize), bool)>.

Benchmarks

cargo bench --bench bpe_benchmark -- 'bpe-encode' (GPT-2 — the primary ByteLevel signal):

Benchmark	main (`bcdd25b9`)	logos	Δ
BPE GPT2 encode	1.608 s	1.244 s	−22.6%
BPE GPT2 encode batch	262.2 ms	244.7 ms	−6.7%
BPE GPT2 encode, no cache	1.958 s	1.520 s	−22.4%
BPE GPT2 encode batch, no cache	338.6 ms	266.9 ms	−21.2%

cargo bench --bench llama3_benchmark — flat, as expected: llama-3's serialized tokenizer uses a custom Split pre-tokenizer with a user-configured regex (dynamic, routed through SysRegex), not the hard-coded ByteLevel pattern. This PR does not touch that path.

Verification

cargo test --lib                                     # 200 passed (logos path, default)
cargo test --lib --no-default-features --features onig  # 200 passed (legacy path)
cargo test --test logos_bytelevel_equivalence        # 2 passed (big.txt sweep clean)
cargo bench --bench bpe_benchmark -- bpe-encode      # numbers above

Scope / caveats

ByteLevel only. Whitespace pre-tokenizer's \w+|[^\w\s]+ was also considered but it's already on the fast pure-Rust regex crate — no meaningful win to chase.
logos-pretok feature is temporary. Intent is to remove it and the legacy path a release cycle after this lands, once no regressions surface. LogosByteLevel is exposed as #[doc(hidden)] pub purely so the equivalence integration test can drive it directly — not intended as a stable API.

🤖 Generated with Claude Code

Adds logos 0.14 as a dependency and introduces a logos-pretok Cargo feature (default on). Temporary flag so the next commit's ByteLevel port can be toggled off for rollback. No behavior change yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replaces the SysRegex (onig/fancy-regex) impl of the ByteLevel pre- tokenizer split pattern with a compile-time DFA generated by `logos`. Gated behind the `logos-pretok` feature (default on); the SysRegex path stays compiled under `cfg(not(feature = "logos-pretok"))` for one-release-cycle rollback. Logos cannot express the `\s+(?!\S)` lookahead branch directly, so the regex is expressed as plain `\s+` and the lookahead semantics (backtrack one char when a multi-char whitespace run is followed by a non-whitespace content token) are replayed as a post-processing pass inside `LogosByteLevel::find_matches`. Existing byte_level unit tests — including `handling_of_multiple_whitespaces` which exercises this exact backtrack — pass on the new path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Logos uses longest-match so a contraction literal like `'t` (2 chars) beats the single-char `[^\s\p{L}\p{N}]+` alternative at the same position. Legacy's leftmost-first regex, after `\s+(?!\S)` backtracks by one char, matches the freed space + quote as a 2-char `Other` span via its earlier-in-alternation `[^...]+` branch — leaving the remaining letters for `\p{L}+`. When the post-processing pass sees `Whitespace(≥2)` immediately followed by `Contraction`, split the contraction into `Other(')` + `Letters(rest)`, extend `Other` backward to absorb the freed ws char, and merge `Letters(rest)` with a following contiguous `Letters` span (skipping the merge when the next Letters span starts with a space, which indicates a fresh ` ?\p{L}+` match rather than a continuation). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Drives both the legacy `SysRegex` (GPT-2 pattern verbatim) and the logos-backed `LogosByteLevel` through the `Pattern` trait on identical input and asserts the output is identical. Two tests: - `adversarial_table`: hand-picked edge cases — every contraction, trailing whitespace at EOF, multi-char ws runs that trigger the `\s+(?!\S)` backtrack, Unicode letters/numbers, combining marks. - `big_txt_corpus`: sweeps `data/big.txt` (≈ 130k lines) line by line (skipped gracefully if the file isn't present). Catches everything the adversarial table misses — in particular the ` 'tis` case that exposed the contraction-after-ws bug. Gated behind `cfg(feature = "logos-pretok")` so it runs under the default feature set. To exercise the SysRegex path alone: cargo test --no-default-features --features onig Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-23T08:03:12Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Extends the logos dispatch into the `Split` pre-tokenizer. When `Split::new` sees the exact tiktoken cl100k_base regex string (used by Llama-3, GPT-3.5/4-class tokenizers), it flags the pre-tokenizer to dispatch through a compile-time DFA (`LogosCl100k`) instead of the `SysRegex` path at encode time. Any other pattern string keeps the existing `SysRegex` behavior unchanged. The port is trickier than ByteLevel because: - Logos uses longest-match, legacy uses leftmost-first. At `'store` the literal matches `'s` (2 chars) but logos matches `'store` (6 chars via the `[^\r\n\p{L}\p{N}]?\p{L}+` Letters alternative). Fixed with a `split_letters_contractions` pre-pass that detects `Letters` spans starting with `'<contraction suffix>` (case- insensitive: s, t, m, d, re, ve, ll) and splits them into `Contraction` + `Letters(rest)`. - `\s+(?!\S)` lookahead same as ByteLevel, replayed in `replay_lookahead`. But now the followers are broader: `Letters` (any non-newline/letter/number prefix), `Numbers` (no prefix — legacy splits ws into `ws(N-1)+ws(1)`), `Other`, `Contraction`. Each case handled separately in the post-processing. - `\s*[\r\n]+` NewlineRun ties with `\s+` when the ws run ends in a newline. Gave NewlineRun `priority = 3` to match legacy's leftmost- first preference. Integration test drives both `Pattern` impls side-by-side on the adversarial table + full `data/big.txt` sweep (~130k lines). Byte-for- byte identical output on both. Benchmarks (llama3_benchmark, main bcdd25b vs this commit): llama3-encode 1.664 s → 1.263 s (-24.1%) llama3-batch 271.5 ms → 243.9 ms (-10.2%) concurrent-long-1t 21.9 ms → 15.9 ms (-27.4%) concurrent-long-2t 28.2 ms → 22.1 ms (-21.6%) concurrent-long-4t 30.7 ms → 24.7 ms (-19.6%) concurrent-long-8t 34.4 ms → 25.6 ms (-25.6%) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ArthurZucker · 2026-04-23T08:46:38Z

Extension: logos fast-path for tiktoken cl100k_base (Llama-3, GPT-3.5/4-class)

Pushed ce093ab7 — extends the same logos dispatch idea into Split::pre_tokenize for the llama-3 / tiktoken cl100k_base pattern. When Split::new sees that exact pattern string, it flags the pre-tokenizer to go through a second compile-time DFA (LogosCl100k). Everything else keeps the existing SysRegex behavior.

Why it wasn't enough to just do ByteLevel

llama-3's serialized tokenizer is:

"pre_tokenizer": { "type": "Sequence", "pretokenizers": [
    { "type": "Split",     "pattern": { "Regex": "(?i:'s|...)..." }, ... },
    { "type": "ByteLevel", "use_regex": false, ... }
]}

So it bypasses the ByteLevel regex path entirely. All the real work happens in Split + SysRegex (onig/fancy-regex via FFI).

Trickier than ByteLevel

Two new things the post-processing has to handle:

Contraction priority across the whole input. Legacy's leftmost-first puts (?i:'s|...) before [^\r\n\p{L}\p{N}]?\p{L}+, so at 'store the contraction 's wins. Logos's longest-match prefers 'store (6 chars via Letters with ' prefix). A split_letters_contractions pre-pass detects Letters spans starting with '<contraction suffix> (case-insensitive s/t/m/d/re/ve/ll) and splits them.
Broader follower types for \s+(?!\S) backtrack. Unlike ByteLevel's ' ?\p{L}+' (space-only prefix), tiktoken's Letters accepts any non-newline/non-letter/non-number prefix. And Numbers (\p{N}{1,3}) has no prefix at all — when ws(N≥2) is followed by Numbers, legacy splits the ws into ws(N-1) + ws(1). Each follower type gets its own branch in replay_lookahead.

Also: \s*[\r\n]+ NewlineRun ties in length with \s+ when the ws run ends in a newline; explicit priority = 3 puts NewlineRun ahead, matching legacy's leftmost-first.

Equivalence

Integration test tests/logos_cl100k_equivalence.rs drives both Pattern impls on the same input: adversarial table (contractions, newlines, prefix chars, the exact 's / 'tis / 'store edge cases) + full data/big.txt sweep (~130k lines). Byte-for-byte identical output on both.

Benchmarks: `cargo bench --bench llama3_benchmark` (main `bcdd25b9` vs `ce093ab7`)

Benchmark	main	logos	Δ
llama3-offsets	232.9 ms	216.8 ms	−6.9%
llama3-encode	1.664 s	1.263 s	−24.1%
llama3-batch	271.5 ms	243.9 ms	−10.2%
concurrent-long-1t	21.9 ms	15.9 ms	−27.4%
concurrent-long-2t	28.2 ms	22.1 ms	−21.6%
concurrent-long-4t	30.7 ms	24.7 ms	−19.6%
concurrent-long-8t	34.4 ms	25.6 ms	−25.6%
BPE Train (big)	1.124 s	1.082 s	−3.7%

On top of the −22% GPT-2 numbers from the ByteLevel commit, this gives comparable wins for the whole tiktoken/cl100k family — GPT-3.5, GPT-4, Llama-3, and any downstream tokenizer that uses the same pattern string.

What it doesn't cover

Other tiktoken patterns (o200k_base for GPT-4o, r50k_base for older GPT-3). Only the exact cl100k_base string matches. Same extension pattern would apply — separate logos enum + pattern-string check — but I'm not adding them here without a clear user.
User-modified regex strings (even a single whitespace difference from CL100K_PATTERN falls back to SysRegex). By design — the fast-path is byte-equivalence-verified against the exact string only.

🤖 Generated with Claude Code

ArthurZucker and others added 4 commits April 23, 2026 16:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(byte_level): port GPT-2 split regex to logos FSM (−22% on GPT-2 encode)#2031

perf(byte_level): port GPT-2 split regex to logos FSM (−22% on GPT-2 encode)#2031
ArthurZucker wants to merge 5 commits intomainfrom
logos-bytelevel

ArthurZucker commented Apr 23, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2026

Uh oh!

ArthurZucker commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 23, 2026

Summary

Equivalence

Benchmarks

Verification

Scope / caveats

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2026

Uh oh!

ArthurZucker commented Apr 23, 2026

Extension: logos fast-path for tiktoken cl100k_base (Llama-3, GPT-3.5/4-class)

Why it wasn't enough to just do ByteLevel

Trickier than ByteLevel

Equivalence

Benchmarks: cargo bench --bench llama3_benchmark (main bcdd25b9 vs ce093ab7)

What it doesn't cover

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Benchmarks: `cargo bench --bench llama3_benchmark` (main `bcdd25b9` vs `ce093ab7`)