Skip to content

perf(byte_level): port GPT-2 split regex to logos FSM (βˆ’22% on GPT-2 encode)#2031

Open
ArthurZucker wants to merge 5 commits intomainfrom
logos-bytelevel
Open

perf(byte_level): port GPT-2 split regex to logos FSM (βˆ’22% on GPT-2 encode)#2031
ArthurZucker wants to merge 5 commits intomainfrom
logos-bytelevel

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

Replaces the SysRegex (onig / fancy-regex) backing the GPT-2 split pattern in ByteLevel::pre_tokenize with a compile-time DFA generated by logos. Gated behind a new logos-pretok Cargo feature, default on; the legacy SysRegex path stays compiled under cfg(not(feature = "logos-pretok")) so downstream users can flip back with a single feature change during the bake-in.

The regex being replaced (src/pre_tokenizers/byte_level.rs:43):

's|'t|'re|'ve|'m|'ll|'d| ?\p{L}+| ?\p{N}+| ?[^\s\p{L}\p{N}]+|\s+(?!\S)|\s+

Ran in every encode for GPT-2 / every byte-level tokenizer. Dispatches via FFI to onig (C lib) because of the (?!\S) negative lookahead and \p{L}/\p{N} classes. Replacing it with a static DFA removes the per-call VM overhead and the FFI boundary.

Equivalence

Logos can't express lookahead, so the \s+(?!\S) semantics are replayed in a post-processing pass inside LogosByteLevel::find_matches:

  • A whitespace run of char-length β‰₯ 2 followed by a content token gets shrunk by one char, and that char is given to the next span as a Β· ? leading-space prefix.
  • Contractions are the subtle case: logos uses longest-match so 't (2 chars) beats [^\s\p{L}\p{N}]+ (1 char) at the same position, whereas the legacy leftmost-first regex β€” after \s+(?!\S) backtracks β€” prefers the earlier [^...]+ alternative and consumes Β· ' as a 2-char Other span. The post-processing pass detects Whitespace(β‰₯2) β†’ Contraction and splits the contraction into Other(') + Letters(rest), extends Other to claim the freed ws char, and merges Letters(rest) into the following contiguous Letters span when appropriate. See the inline comment in byte_level.rs and the big_txt_corpus equivalence test for how this was found.

New integration test (tokenizers/tests/logos_bytelevel_equivalence.rs) drives both Pattern impls side-by-side:

  • Adversarial table: every contraction literal, trailing ws at EOF, multi-char ws runs that trigger the backtrack, Unicode letters/numbers/combining marks.
  • Corpus sweep: every line of data/big.txt (β‰ˆ130k lines). Both produce byte-for-byte identical Vec<((usize, usize), bool)>.

Benchmarks

cargo bench --bench bpe_benchmark -- 'bpe-encode' (GPT-2 β€” the primary ByteLevel signal):

Benchmark main (bcdd25b9) logos Ξ”
BPE GPT2 encode 1.608 s 1.244 s βˆ’22.6%
BPE GPT2 encode batch 262.2 ms 244.7 ms βˆ’6.7%
BPE GPT2 encode, no cache 1.958 s 1.520 s βˆ’22.4%
BPE GPT2 encode batch, no cache 338.6 ms 266.9 ms βˆ’21.2%

cargo bench --bench llama3_benchmark β€” flat, as expected: llama-3's serialized tokenizer uses a custom Split pre-tokenizer with a user-configured regex (dynamic, routed through SysRegex), not the hard-coded ByteLevel pattern. This PR does not touch that path.

Verification

cargo test --lib                                     # 200 passed (logos path, default)
cargo test --lib --no-default-features --features onig  # 200 passed (legacy path)
cargo test --test logos_bytelevel_equivalence        # 2 passed (big.txt sweep clean)
cargo bench --bench bpe_benchmark -- bpe-encode      # numbers above

Scope / caveats

  • ByteLevel only. Whitespace pre-tokenizer's \w+|[^\w\s]+ was also considered but it's already on the fast pure-Rust regex crate β€” no meaningful win to chase.
  • logos-pretok feature is temporary. Intent is to remove it and the legacy path a release cycle after this lands, once no regressions surface. LogosByteLevel is exposed as #[doc(hidden)] pub purely so the equivalence integration test can drive it directly β€” not intended as a stable API.

πŸ€– Generated with Claude Code

ArthurZucker and others added 4 commits April 23, 2026 16:29
Adds logos 0.14 as a dependency and introduces a logos-pretok Cargo
feature (default on). Temporary flag so the next commit's ByteLevel
port can be toggled off for rollback. No behavior change yet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the SysRegex (onig/fancy-regex) impl of the ByteLevel pre-
tokenizer split pattern with a compile-time DFA generated by `logos`.
Gated behind the `logos-pretok` feature (default on); the SysRegex
path stays compiled under `cfg(not(feature = "logos-pretok"))` for
one-release-cycle rollback.

Logos cannot express the `\s+(?!\S)` lookahead branch directly, so
the regex is expressed as plain `\s+` and the lookahead semantics
(backtrack one char when a multi-char whitespace run is followed by
a non-whitespace content token) are replayed as a post-processing
pass inside `LogosByteLevel::find_matches`. Existing byte_level unit
tests β€” including `handling_of_multiple_whitespaces` which exercises
this exact backtrack β€” pass on the new path unchanged.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Logos uses longest-match so a contraction literal like `'t` (2 chars)
beats the single-char `[^\s\p{L}\p{N}]+` alternative at the same
position. Legacy's leftmost-first regex, after `\s+(?!\S)` backtracks
by one char, matches the freed space + quote as a 2-char `Other` span
via its earlier-in-alternation `[^...]+` branch β€” leaving the
remaining letters for `\p{L}+`.

When the post-processing pass sees `Whitespace(β‰₯2)` immediately
followed by `Contraction`, split the contraction into `Other(')` +
`Letters(rest)`, extend `Other` backward to absorb the freed ws char,
and merge `Letters(rest)` with a following contiguous `Letters` span
(skipping the merge when the next Letters span starts with a space,
which indicates a fresh ` ?\p{L}+` match rather than a continuation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives both the legacy `SysRegex` (GPT-2 pattern verbatim) and the
logos-backed `LogosByteLevel` through the `Pattern` trait on identical
input and asserts the output is identical.

Two tests:
  - `adversarial_table`: hand-picked edge cases β€” every contraction,
    trailing whitespace at EOF, multi-char ws runs that trigger the
    `\s+(?!\S)` backtrack, Unicode letters/numbers, combining marks.
  - `big_txt_corpus`: sweeps `data/big.txt` (β‰ˆ 130k lines) line by line
    (skipped gracefully if the file isn't present). Catches everything
    the adversarial table misses β€” in particular the `  'tis` case
    that exposed the contraction-after-ws bug.

Gated behind `cfg(feature = "logos-pretok")` so it runs under the
default feature set. To exercise the SysRegex path alone:
  cargo test --no-default-features --features onig

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Extends the logos dispatch into the `Split` pre-tokenizer. When
`Split::new` sees the exact tiktoken cl100k_base regex string (used by
Llama-3, GPT-3.5/4-class tokenizers), it flags the pre-tokenizer to
dispatch through a compile-time DFA (`LogosCl100k`) instead of the
`SysRegex` path at encode time. Any other pattern string keeps the
existing `SysRegex` behavior unchanged.

The port is trickier than ByteLevel because:
  - Logos uses longest-match, legacy uses leftmost-first. At `'store`
    the literal matches `'s` (2 chars) but logos matches `'store` (6
    chars via the `[^\r\n\p{L}\p{N}]?\p{L}+` Letters alternative).
    Fixed with a `split_letters_contractions` pre-pass that detects
    `Letters` spans starting with `'<contraction suffix>` (case-
    insensitive: s, t, m, d, re, ve, ll) and splits them into
    `Contraction` + `Letters(rest)`.
  - `\s+(?!\S)` lookahead same as ByteLevel, replayed in
    `replay_lookahead`. But now the followers are broader: `Letters`
    (any non-newline/letter/number prefix), `Numbers` (no prefix β€”
    legacy splits ws into `ws(N-1)+ws(1)`), `Other`, `Contraction`.
    Each case handled separately in the post-processing.
  - `\s*[\r\n]+` NewlineRun ties with `\s+` when the ws run ends in a
    newline. Gave NewlineRun `priority = 3` to match legacy's leftmost-
    first preference.

Integration test drives both `Pattern` impls side-by-side on the
adversarial table + full `data/big.txt` sweep (~130k lines). Byte-for-
byte identical output on both.

Benchmarks (llama3_benchmark, main bcdd25b vs this commit):
  llama3-encode        1.664 s  β†’ 1.263 s  (-24.1%)
  llama3-batch         271.5 ms β†’ 243.9 ms (-10.2%)
  concurrent-long-1t   21.9 ms  β†’ 15.9 ms  (-27.4%)
  concurrent-long-2t   28.2 ms  β†’ 22.1 ms  (-21.6%)
  concurrent-long-4t   30.7 ms  β†’ 24.7 ms  (-19.6%)
  concurrent-long-8t   34.4 ms  β†’ 25.6 ms  (-25.6%)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

Extension: logos fast-path for tiktoken cl100k_base (Llama-3, GPT-3.5/4-class)

Pushed ce093ab7 β€” extends the same logos dispatch idea into Split::pre_tokenize for the llama-3 / tiktoken cl100k_base pattern. When Split::new sees that exact pattern string, it flags the pre-tokenizer to go through a second compile-time DFA (LogosCl100k). Everything else keeps the existing SysRegex behavior.

Why it wasn't enough to just do ByteLevel

llama-3's serialized tokenizer is:

"pre_tokenizer": { "type": "Sequence", "pretokenizers": [
    { "type": "Split",     "pattern": { "Regex": "(?i:'s|...)..." }, ... },
    { "type": "ByteLevel", "use_regex": false, ... }
]}

So it bypasses the ByteLevel regex path entirely. All the real work happens in Split + SysRegex (onig/fancy-regex via FFI).

Trickier than ByteLevel

Two new things the post-processing has to handle:

  • Contraction priority across the whole input. Legacy's leftmost-first puts (?i:'s|...) before [^\r\n\p{L}\p{N}]?\p{L}+, so at 'store the contraction 's wins. Logos's longest-match prefers 'store (6 chars via Letters with ' prefix). A split_letters_contractions pre-pass detects Letters spans starting with '<contraction suffix> (case-insensitive s/t/m/d/re/ve/ll) and splits them.
  • Broader follower types for \s+(?!\S) backtrack. Unlike ByteLevel's ' ?\p{L}+' (space-only prefix), tiktoken's Letters accepts any non-newline/non-letter/non-number prefix. And Numbers (\p{N}{1,3}) has no prefix at all β€” when ws(Nβ‰₯2) is followed by Numbers, legacy splits the ws into ws(N-1) + ws(1). Each follower type gets its own branch in replay_lookahead.

Also: \s*[\r\n]+ NewlineRun ties in length with \s+ when the ws run ends in a newline; explicit priority = 3 puts NewlineRun ahead, matching legacy's leftmost-first.

Equivalence

Integration test tests/logos_cl100k_equivalence.rs drives both Pattern impls on the same input: adversarial table (contractions, newlines, prefix chars, the exact 's / 'tis / 'store edge cases) + full data/big.txt sweep (~130k lines). Byte-for-byte identical output on both.

Benchmarks: cargo bench --bench llama3_benchmark (main bcdd25b9 vs ce093ab7)

Benchmark main logos Ξ”
llama3-offsets 232.9 ms 216.8 ms βˆ’6.9%
llama3-encode 1.664 s 1.263 s βˆ’24.1%
llama3-batch 271.5 ms 243.9 ms βˆ’10.2%
concurrent-long-1t 21.9 ms 15.9 ms βˆ’27.4%
concurrent-long-2t 28.2 ms 22.1 ms βˆ’21.6%
concurrent-long-4t 30.7 ms 24.7 ms βˆ’19.6%
concurrent-long-8t 34.4 ms 25.6 ms βˆ’25.6%
BPE Train (big) 1.124 s 1.082 s βˆ’3.7%

On top of the βˆ’22% GPT-2 numbers from the ByteLevel commit, this gives comparable wins for the whole tiktoken/cl100k family β€” GPT-3.5, GPT-4, Llama-3, and any downstream tokenizer that uses the same pattern string.

What it doesn't cover

  • Other tiktoken patterns (o200k_base for GPT-4o, r50k_base for older GPT-3). Only the exact cl100k_base string matches. Same extension pattern would apply β€” separate logos enum + pattern-string check β€” but I'm not adding them here without a clear user.
  • User-modified regex strings (even a single whitespace difference from CL100K_PATTERN falls back to SysRegex). By design β€” the fast-path is byte-equivalence-verified against the exact string only.

πŸ€– Generated with Claude Code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants