perf(byte_level): port GPT-2 split regex to logos FSM (β22% on GPT-2 encode)#2031
perf(byte_level): port GPT-2 split regex to logos FSM (β22% on GPT-2 encode)#2031ArthurZucker wants to merge 5 commits intomainfrom
Conversation
Adds logos 0.14 as a dependency and introduces a logos-pretok Cargo feature (default on). Temporary flag so the next commit's ByteLevel port can be toggled off for rollback. No behavior change yet. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the SysRegex (onig/fancy-regex) impl of the ByteLevel pre- tokenizer split pattern with a compile-time DFA generated by `logos`. Gated behind the `logos-pretok` feature (default on); the SysRegex path stays compiled under `cfg(not(feature = "logos-pretok"))` for one-release-cycle rollback. Logos cannot express the `\s+(?!\S)` lookahead branch directly, so the regex is expressed as plain `\s+` and the lookahead semantics (backtrack one char when a multi-char whitespace run is followed by a non-whitespace content token) are replayed as a post-processing pass inside `LogosByteLevel::find_matches`. Existing byte_level unit tests β including `handling_of_multiple_whitespaces` which exercises this exact backtrack β pass on the new path unchanged. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Logos uses longest-match so a contraction literal like `'t` (2 chars)
beats the single-char `[^\s\p{L}\p{N}]+` alternative at the same
position. Legacy's leftmost-first regex, after `\s+(?!\S)` backtracks
by one char, matches the freed space + quote as a 2-char `Other` span
via its earlier-in-alternation `[^...]+` branch β leaving the
remaining letters for `\p{L}+`.
When the post-processing pass sees `Whitespace(β₯2)` immediately
followed by `Contraction`, split the contraction into `Other(')` +
`Letters(rest)`, extend `Other` backward to absorb the freed ws char,
and merge `Letters(rest)` with a following contiguous `Letters` span
(skipping the merge when the next Letters span starts with a space,
which indicates a fresh ` ?\p{L}+` match rather than a continuation).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drives both the legacy `SysRegex` (GPT-2 pattern verbatim) and the
logos-backed `LogosByteLevel` through the `Pattern` trait on identical
input and asserts the output is identical.
Two tests:
- `adversarial_table`: hand-picked edge cases β every contraction,
trailing whitespace at EOF, multi-char ws runs that trigger the
`\s+(?!\S)` backtrack, Unicode letters/numbers, combining marks.
- `big_txt_corpus`: sweeps `data/big.txt` (β 130k lines) line by line
(skipped gracefully if the file isn't present). Catches everything
the adversarial table misses β in particular the ` 'tis` case
that exposed the contraction-after-ws bug.
Gated behind `cfg(feature = "logos-pretok")` so it runs under the
default feature set. To exercise the SysRegex path alone:
cargo test --no-default-features --features onig
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Extends the logos dispatch into the `Split` pre-tokenizer. When
`Split::new` sees the exact tiktoken cl100k_base regex string (used by
Llama-3, GPT-3.5/4-class tokenizers), it flags the pre-tokenizer to
dispatch through a compile-time DFA (`LogosCl100k`) instead of the
`SysRegex` path at encode time. Any other pattern string keeps the
existing `SysRegex` behavior unchanged.
The port is trickier than ByteLevel because:
- Logos uses longest-match, legacy uses leftmost-first. At `'store`
the literal matches `'s` (2 chars) but logos matches `'store` (6
chars via the `[^\r\n\p{L}\p{N}]?\p{L}+` Letters alternative).
Fixed with a `split_letters_contractions` pre-pass that detects
`Letters` spans starting with `'<contraction suffix>` (case-
insensitive: s, t, m, d, re, ve, ll) and splits them into
`Contraction` + `Letters(rest)`.
- `\s+(?!\S)` lookahead same as ByteLevel, replayed in
`replay_lookahead`. But now the followers are broader: `Letters`
(any non-newline/letter/number prefix), `Numbers` (no prefix β
legacy splits ws into `ws(N-1)+ws(1)`), `Other`, `Contraction`.
Each case handled separately in the post-processing.
- `\s*[\r\n]+` NewlineRun ties with `\s+` when the ws run ends in a
newline. Gave NewlineRun `priority = 3` to match legacy's leftmost-
first preference.
Integration test drives both `Pattern` impls side-by-side on the
adversarial table + full `data/big.txt` sweep (~130k lines). Byte-for-
byte identical output on both.
Benchmarks (llama3_benchmark, main bcdd25b vs this commit):
llama3-encode 1.664 s β 1.263 s (-24.1%)
llama3-batch 271.5 ms β 243.9 ms (-10.2%)
concurrent-long-1t 21.9 ms β 15.9 ms (-27.4%)
concurrent-long-2t 28.2 ms β 22.1 ms (-21.6%)
concurrent-long-4t 30.7 ms β 24.7 ms (-19.6%)
concurrent-long-8t 34.4 ms β 25.6 ms (-25.6%)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Extension: logos fast-path for tiktoken cl100k_base (Llama-3, GPT-3.5/4-class)Pushed Why it wasn't enough to just do ByteLevelllama-3's serialized tokenizer is: "pre_tokenizer": { "type": "Sequence", "pretokenizers": [
{ "type": "Split", "pattern": { "Regex": "(?i:'s|...)..." }, ... },
{ "type": "ByteLevel", "use_regex": false, ... }
]}So it bypasses the Trickier than ByteLevelTwo new things the post-processing has to handle:
Also: EquivalenceIntegration test Benchmarks:
|
| Benchmark | main | logos | Ξ |
|---|---|---|---|
| llama3-offsets | 232.9 ms | 216.8 ms | β6.9% |
| llama3-encode | 1.664 s | 1.263 s | β24.1% |
| llama3-batch | 271.5 ms | 243.9 ms | β10.2% |
| concurrent-long-1t | 21.9 ms | 15.9 ms | β27.4% |
| concurrent-long-2t | 28.2 ms | 22.1 ms | β21.6% |
| concurrent-long-4t | 30.7 ms | 24.7 ms | β19.6% |
| concurrent-long-8t | 34.4 ms | 25.6 ms | β25.6% |
| BPE Train (big) | 1.124 s | 1.082 s | β3.7% |
On top of the β22% GPT-2 numbers from the ByteLevel commit, this gives comparable wins for the whole tiktoken/cl100k family β GPT-3.5, GPT-4, Llama-3, and any downstream tokenizer that uses the same pattern string.
What it doesn't cover
- Other tiktoken patterns (o200k_base for GPT-4o, r50k_base for older GPT-3). Only the exact cl100k_base string matches. Same extension pattern would apply β separate logos enum + pattern-string check β but I'm not adding them here without a clear user.
- User-modified regex strings (even a single whitespace difference from CL100K_PATTERN falls back to
SysRegex). By design β the fast-path is byte-equivalence-verified against the exact string only.
π€ Generated with Claude Code
Summary
Replaces the
SysRegex(onig / fancy-regex) backing the GPT-2 split pattern inByteLevel::pre_tokenizewith a compile-time DFA generated bylogos. Gated behind a newlogos-pretokCargo feature, default on; the legacySysRegexpath stays compiled undercfg(not(feature = "logos-pretok"))so downstream users can flip back with a single feature change during the bake-in.The regex being replaced (
src/pre_tokenizers/byte_level.rs:43):Ran in every encode for GPT-2 / every byte-level tokenizer. Dispatches via FFI to
onig(C lib) because of the(?!\S)negative lookahead and\p{L}/\p{N}classes. Replacing it with a static DFA removes the per-call VM overhead and the FFI boundary.Equivalence
Logos can't express lookahead, so the
\s+(?!\S)semantics are replayed in a post-processing pass insideLogosByteLevel::find_matches:Β· ?leading-space prefix.'t(2 chars) beats[^\s\p{L}\p{N}]+(1 char) at the same position, whereas the legacy leftmost-first regex β after\s+(?!\S)backtracks β prefers the earlier[^...]+alternative and consumesΒ· 'as a 2-charOtherspan. The post-processing pass detectsWhitespace(β₯2) β Contractionand splits the contraction intoOther(')+Letters(rest), extendsOtherto claim the freed ws char, and mergesLetters(rest)into the following contiguous Letters span when appropriate. See the inline comment inbyte_level.rsand thebig_txt_corpusequivalence test for how this was found.New integration test (
tokenizers/tests/logos_bytelevel_equivalence.rs) drives bothPatternimpls side-by-side:data/big.txt(β130k lines). Both produce byte-for-byte identicalVec<((usize, usize), bool)>.Benchmarks
cargo bench --bench bpe_benchmark -- 'bpe-encode'(GPT-2 β the primary ByteLevel signal):bcdd25b9)cargo bench --bench llama3_benchmarkβ flat, as expected: llama-3's serialized tokenizer uses a customSplitpre-tokenizer with a user-configured regex (dynamic, routed throughSysRegex), not the hard-codedByteLevelpattern. This PR does not touch that path.Verification
Scope / caveats
\w+|[^\w\s]+was also considered but it's already on the fast pure-Rustregexcrate β no meaningful win to chase.logos-pretokfeature is temporary. Intent is to remove it and the legacy path a release cycle after this lands, once no regressions surface.LogosByteLevelis exposed as#[doc(hidden)] pubpurely so the equivalence integration test can drive it directly β not intended as a stable API.π€ Generated with Claude Code