Skip to content

Reduce crate size#2015

Open
ArthurZucker wants to merge 16 commits intomainfrom
reduce-crate-size
Open

Reduce crate size#2015
ArthurZucker wants to merge 16 commits intomainfrom
reduce-crate-size

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker commented Apr 9, 2026

Reduce tokenizers crate size

Reduce the on-device library size of the tokenizers crate from 2.65 MB → 0.96 MB (64% reduction) for inference-only deployments.

Size reduction breakdown

Measured on macOS arm64, stripped cdylib, LTO fat, opt-level=s:

main (today, all features)                                     2.65 MB
├── drop deps (derive_builder, itertools, monostate, ahash)   -0.32 MB  (done)
├── feature-gate training/spm/parallel/unicode-norm           -0.34 MB  (done)
├── make regex optional                                       -0.65 MB  (done)
├── drop dary_heap serde feature                              -tiny     (done)
├── codegen-units = 1                                         -0.03 MB  (done, in profile)
├── panic = "abort"                                        \
├── -Zlocation-detail=none -Zfmt-debug=none                 } -0.35 MB combined
├── build-std=std,panic_abort optimize_for_size            /
└───────────────────────────────────────────────
    TOTAL with nightly build-std                               0.96 MB ✓
    TOTAL with stable (codegen-units=1, panic=abort)          ~1.18 MB

Comparison with Meta pytorch/tokenizers (C++)

Meta (C++) HuggingFace (Rust, this PR)
Stripped binary ~0.8 MB 0.96 MB (nightly) / 1.18 MB (stable)
Models BPE, SentencePiece, Tiktoken BPE, WordPiece, Unigram, WordLevel
Normalizers Replace, Prepend, NFC, Sequence All of the above + NFKC/NFD/NFKD, Bert, Lowercase, Strip, StripAccents, ByteLevel, NMT
Pre-tokenizers Regex, Digits, ByteLevel, Sequence All of the above + Whitespace, Metaspace, Punctuation, Split, UnicodeScripts, BertPreTokenizer, CharDelimiter, FixedLength
Post-processors TemplateProcessing, Sequence All of the above + BertProcessing, RobertaProcessing, ByteLevel
Decoders Basic token decoder BPE, ByteLevel, WordPiece, Metaspace, CTC, ByteFallback, Fuse, Strip, Replace, Sequence
Added tokens Basic special tokens only Full AddedToken with single-word, lstrip/rstrip, normalized, per-token config
Training ✅ (feature-gated)
Padding / Truncation
Batch encoding ✅ (with optional rayon parallelism)
tokenizer.json loading ✅ (partial — many post-processors are TODO) ✅ (full)

We're ~20% larger than Meta's C++ implementation while supporting significantly more features. The gap is primarily serde's JSON deserialization infrastructure (~225 KB) which Meta avoids by using hardcoded loaders.

What changed

Dependency cleanup:

  • Replaced derive_builder with manual builders
  • Replaced itertools with std equivalents
  • Replaced ahash with foldhash (smaller, faster on benchmarks)
  • Dropped monostate
  • Removed serde feature from dary_heap (never serialized)

Feature gates (all backward-compatible, enabled by default):

  • training — gates rand, esaxx-rs, compact_str, all trainer impls
  • spm — gates spm_precompiled, unicode-segmentation
  • parallel — gates rayon, rayon-cond
  • unicode-normalization — gates NFC/NFD/NFKC/NFKD normalizers
  • regexnew: gates the Rust regex crate. When disabled, all regex operations use the system regex engine (onig or fancy-regex). Replaced 4 regex statics in added_vocabulary.rs with char ops, Whitespace pre-tokenizer with SysRegex, Pattern for &str with str::match_indices.

Build profiles:

  • Added release-small profile with opt-level = "s", strip = true, panic = "abort", codegen-units = 1
  • Documented nightly build-std command for sub-1MB builds

CI:

  • Added bundle size reporting via $GITHUB_STEP_SUMMARY to Rust and Python release workflows
  • Fixed macOS abi3 cross-compilation RUSTFLAGS in CI

Minimal inference-only configuration

# Cargo.toml — smallest possible, uses Oniguruma regex (C dep)
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }

# Or pure Rust for WASM (no C dependencies)
tokenizers = { version = "0.22", default-features = false, features = ["unstable_wasm"] }

For the absolute smallest binary (nightly):

RUSTFLAGS="-Zlocation-detail=none -Zfmt-debug=none" cargo +nightly build \
  -Z build-std=std,panic_abort -Z build-std-features="optimize_for_size" \
  --target aarch64-apple-darwin --profile release-small

Benchmark results (ahash → foldhash)

Foldhash is equal-or-faster on all benchmarks. Training is 7-10% faster, encoding is 2-3% faster.

Benchmark ahash (main) foldhash (this PR) Delta
BPE GPT2 encode 1.718 s 1.668 s -3.0%
BPE train large 958 ms 883 ms -7.8%
llama3 encode 1.687 s 1.644 s -2.6%
llama3 train big 1.124 s 1.016 s -9.6%
unigram train big 675 ms 614 ms -9.1%
BERT encode 1.560 s 1.558 s -0.1%
BERT train big 901 ms 871 ms -3.3%

ArthurZucker and others added 6 commits April 9, 2026 17:27
…e_builder and itertools

- Remove `derive_builder` dep (5.6 MB rlib) - replace with manual builders for
  UnigramTrainer, WordLevelTrainer, and TemplateProcessing
- Remove `itertools` dep (2.6 MB rlib) - replace with manual dedup in CTC decoder,
  Box<dyn Iterator> in train_from_files, and Vec::join in template validation
- Add `training` feature flag (default on) - gates all trainer code, rand, esaxx-rs
- Add `spm` feature flag (default on) - gates spm_precompiled/nom/base64

Results with default features (full backward compat):
  - Direct deps: 27 -> 24 (-3)
  - Transitive deps: 119 -> 100 (-19 crates)
  - Zero new warnings

Results with --no-default-features --features "onig" (inference-only):
  - Direct deps: 20
  - Transitive deps: 81 (-38 crates vs original)
  - rlib: 11.6 MB (down from 13.6 MB)
…ash/monostate

Change 1: Slim regex Unicode features (-6.2 MB rlib)
  - Use only unicode-perl instead of full unicode support
  - Only \p{L}, \p{N}, \w, \s are used in the codebase

Change 2: Feature-gate rayon ("parallel" feature, default on) (-9.4 MB rlib)
  - rayon + rayon-cond are now optional behind "parallel" feature
  - Serial-only fallback for on-device (single-core) deployments
  - Eliminates transitive itertools dependency from rayon-cond

Change 3: Replace monostate with impl_serde_type! macro (-0.4 MB rlib)
  - ByteFallback and Fuse now use existing impl_serde_type! macro
  - Removes monostate + monostate-impl dependencies

Change 4: Replace ahash with foldhash (-16 MB rlib)
  - foldhash is 116 KB with zero deps vs ahash 291 KB + zerocopy 15.7 MB
  - AHashMap/AHashSet type aliases now use foldhash::fast::FixedState
  - Eliminates zerocopy + zerocopy-derive (proc-macro bloat)

Combined results for inference-only (--no-default-features --features "onig"):
  - Runtime dep rlib total: 84.6 MB -> 55.2 MB (-29.4 MB, -35%)
  - Excluding compile-time-only proc-macros: 44.8 MB (below 50 MB target)
  - Transitive deps: 81 -> 57 (-24 crates)
…tation

- Add `unicode-normalization` feature (default on) gating NFC/NFD/NFKC/NFKD
  normalizers and the unicode-normalization-alignments dep (2.5 MB rlib)
- Make `compact_str` optional, only pulled in by `training` feature (0.9 MB)
- Make `unicode-segmentation` optional, only pulled in by `spm` feature (1.2 MB)

Inference-only rlib total: 55.2 MB -> 47.1 MB (-8.1 MB)
Excluding proc-macros: 36.7 MB
Transitive deps: 57 -> 29 (inference-only)
…ment guide

- Document all feature flags with what deps they save
- Add on-device/embedded configuration examples
- Add measured bundle sizes (.dylib vs .a vs final link)
- Add comparison with Meta pytorch/tokenizers (C++)
- Add step-by-step instructions to measure bundle size
- Add CI regression test script
…singBuilder API

- Replace `use ahash::AHashMap` with `use tokenizers::utils::AHashMap` in
  Python binding, Node binding, and integration tests
- Remove `ahash` direct dependency from Python and Node binding Cargo.toml
- Add `.single()` and `.pair()` methods to TemplateProcessingBuilder
  (non-try versions needed by Python binding)
Add crate/wheel size reporting via $GITHUB_STEP_SUMMARY to both Rust and
Python release workflows. Run cargo fmt across the codebase.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker and others added 3 commits April 10, 2026 12:12
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@ArthurZucker
Copy link
Copy Markdown
Collaborator Author

/benchmark

@github-actions
Copy link
Copy Markdown

Python Benchmark Results

Commit: 994022f1f942484156d7b339c27e6bfebbdacb18

Python Benchmarks

@github-actions
Copy link
Copy Markdown

Rust Benchmark Results

Commit: 994022f1f942484156d7b339c27e6bfebbdacb18

Rust Benchmarks

ArthurZucker and others added 5 commits April 11, 2026 08:31
…ll profile

- Make `regex` crate an optional dependency (feature-gated, on by default).
  When disabled, all regex usage replaced with char ops + SysRegex (onig/fancy-regex).
  Saves ~650 KB in the linked binary.

- Replace regex::Regex in added_vocabulary with char operations (is_word_char)
- Replace regex::Regex in Whitespace pre-tokenizer with SysRegex
- Replace Pattern for &str impl: regex → str::match_indices
- Add regex_escape() and is_word_char() utilities
- Gate Pattern for &regex::Regex behind #[cfg(feature = "regex")]

- Python bindings: add strip = true and lto = "fat" to release profile
  (7.66 MB → 5.55 MB, -27.5%)

- Add release-small profile (opt-level=s, strip, panic=abort, codegen-units=1)
- Drop unused serde feature from dary_heap

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Bundles onig + unicode-normalization + spm — all inference capabilities
without training, parallel, regex, or progressbar.

Build with: cargo build --profile release-small --no-default-features --features inference

Measured: 1.45 MB (stable, panic=abort, codegen-units=1, LTO fat, opt-s, strip)

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Route wrapper enum deserialization through `from_str` instead of
`serde_json::from_value`. This eliminates 21 monomorphized copies of the
Value deserializer infrastructure (~66 KB savings).

The tradeoff is one extra Value→String→T roundtrip at tokenizer load time
(negligible for a one-time operation).

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The .rlib includes unused code that the final linker strips — it's not
representative of on-device size. Build a minimal cdylib that links
tokenizers with each feature set and measure the stripped output.

For Python wheels, also extract the wheel and report the installed
.so/.pyd size (what actually loads at runtime) in addition to the
compressed wheel size.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants