Open
Conversation
…e_builder and itertools - Remove `derive_builder` dep (5.6 MB rlib) - replace with manual builders for UnigramTrainer, WordLevelTrainer, and TemplateProcessing - Remove `itertools` dep (2.6 MB rlib) - replace with manual dedup in CTC decoder, Box<dyn Iterator> in train_from_files, and Vec::join in template validation - Add `training` feature flag (default on) - gates all trainer code, rand, esaxx-rs - Add `spm` feature flag (default on) - gates spm_precompiled/nom/base64 Results with default features (full backward compat): - Direct deps: 27 -> 24 (-3) - Transitive deps: 119 -> 100 (-19 crates) - Zero new warnings Results with --no-default-features --features "onig" (inference-only): - Direct deps: 20 - Transitive deps: 81 (-38 crates vs original) - rlib: 11.6 MB (down from 13.6 MB)
…ash/monostate
Change 1: Slim regex Unicode features (-6.2 MB rlib)
- Use only unicode-perl instead of full unicode support
- Only \p{L}, \p{N}, \w, \s are used in the codebase
Change 2: Feature-gate rayon ("parallel" feature, default on) (-9.4 MB rlib)
- rayon + rayon-cond are now optional behind "parallel" feature
- Serial-only fallback for on-device (single-core) deployments
- Eliminates transitive itertools dependency from rayon-cond
Change 3: Replace monostate with impl_serde_type! macro (-0.4 MB rlib)
- ByteFallback and Fuse now use existing impl_serde_type! macro
- Removes monostate + monostate-impl dependencies
Change 4: Replace ahash with foldhash (-16 MB rlib)
- foldhash is 116 KB with zero deps vs ahash 291 KB + zerocopy 15.7 MB
- AHashMap/AHashSet type aliases now use foldhash::fast::FixedState
- Eliminates zerocopy + zerocopy-derive (proc-macro bloat)
Combined results for inference-only (--no-default-features --features "onig"):
- Runtime dep rlib total: 84.6 MB -> 55.2 MB (-29.4 MB, -35%)
- Excluding compile-time-only proc-macros: 44.8 MB (below 50 MB target)
- Transitive deps: 81 -> 57 (-24 crates)
…tation - Add `unicode-normalization` feature (default on) gating NFC/NFD/NFKC/NFKD normalizers and the unicode-normalization-alignments dep (2.5 MB rlib) - Make `compact_str` optional, only pulled in by `training` feature (0.9 MB) - Make `unicode-segmentation` optional, only pulled in by `spm` feature (1.2 MB) Inference-only rlib total: 55.2 MB -> 47.1 MB (-8.1 MB) Excluding proc-macros: 36.7 MB Transitive deps: 57 -> 29 (inference-only)
…ment guide - Document all feature flags with what deps they save - Add on-device/embedded configuration examples - Add measured bundle sizes (.dylib vs .a vs final link) - Add comparison with Meta pytorch/tokenizers (C++) - Add step-by-step instructions to measure bundle size - Add CI regression test script
…singBuilder API - Replace `use ahash::AHashMap` with `use tokenizers::utils::AHashMap` in Python binding, Node binding, and integration tests - Remove `ahash` direct dependency from Python and Node binding Cargo.toml - Add `.single()` and `.pair()` methods to TemplateProcessingBuilder (non-try versions needed by Python binding)
Add crate/wheel size reporting via $GITHUB_STEP_SUMMARY to both Rust and Python release workflows. Run cargo fmt across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Collaborator
Author
|
/benchmark |
Collaborator
Author
|
/benchmark |
Collaborator
Author
|
/benchmark |
…ll profile - Make `regex` crate an optional dependency (feature-gated, on by default). When disabled, all regex usage replaced with char ops + SysRegex (onig/fancy-regex). Saves ~650 KB in the linked binary. - Replace regex::Regex in added_vocabulary with char operations (is_word_char) - Replace regex::Regex in Whitespace pre-tokenizer with SysRegex - Replace Pattern for &str impl: regex → str::match_indices - Add regex_escape() and is_word_char() utilities - Gate Pattern for ®ex::Regex behind #[cfg(feature = "regex")] - Python bindings: add strip = true and lto = "fat" to release profile (7.66 MB → 5.55 MB, -27.5%) - Add release-small profile (opt-level=s, strip, panic=abort, codegen-units=1) - Drop unused serde feature from dary_heap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles onig + unicode-normalization + spm — all inference capabilities without training, parallel, regex, or progressbar. Build with: cargo build --profile release-small --no-default-features --features inference Measured: 1.45 MB (stable, panic=abort, codegen-units=1, LTO fat, opt-s, strip) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Route wrapper enum deserialization through `from_str` instead of `serde_json::from_value`. This eliminates 21 monomorphized copies of the Value deserializer infrastructure (~66 KB savings). The tradeoff is one extra Value→String→T roundtrip at tokenizer load time (negligible for a one-time operation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…rphization" This reverts commit 8976c12.
The .rlib includes unused code that the final linker strips — it's not representative of on-device size. Build a minimal cdylib that links tokenizers with each feature set and measure the stripped output. For Python wheels, also extract the wheel and report the installed .so/.pyd size (what actually loads at runtime) in addition to the compressed wheel size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Reduce tokenizers crate size
Reduce the on-device library size of the
tokenizerscrate from 2.65 MB → 0.96 MB (64% reduction) for inference-only deployments.Size reduction breakdown
Measured on macOS arm64, stripped cdylib, LTO fat, opt-level=s:
Comparison with Meta pytorch/tokenizers (C++)
AddedTokenwith single-word, lstrip/rstrip, normalized, per-token configtokenizer.jsonloadingWe're ~20% larger than Meta's C++ implementation while supporting significantly more features. The gap is primarily serde's JSON deserialization infrastructure (~225 KB) which Meta avoids by using hardcoded loaders.
What changed
Dependency cleanup:
derive_builderwith manual buildersitertoolswith std equivalentsahashwithfoldhash(smaller, faster on benchmarks)monostateserdefeature fromdary_heap(never serialized)Feature gates (all backward-compatible, enabled by default):
training— gates rand, esaxx-rs, compact_str, all trainer implsspm— gates spm_precompiled, unicode-segmentationparallel— gates rayon, rayon-condunicode-normalization— gates NFC/NFD/NFKC/NFKD normalizersregex— new: gates the Rustregexcrate. When disabled, all regex operations use the system regex engine (onig or fancy-regex). Replaced 4 regex statics inadded_vocabulary.rswith char ops,Whitespacepre-tokenizer withSysRegex,Pattern for &strwithstr::match_indices.Build profiles:
release-smallprofile withopt-level = "s",strip = true,panic = "abort",codegen-units = 1build-stdcommand for sub-1MB buildsCI:
$GITHUB_STEP_SUMMARYto Rust and Python release workflowsRUSTFLAGSin CIMinimal inference-only configuration
For the absolute smallest binary (nightly):
Benchmark results (ahash → foldhash)
Foldhash is equal-or-faster on all benchmarks. Training is 7-10% faster, encoding is 2-3% faster.