Reduce crate size by ArthurZucker · Pull Request #2015 · huggingface/tokenizers

ArthurZucker · 2026-04-09T16:05:03Z

Reduce tokenizers crate size

Reduce the on-device library size of the tokenizers crate from 2.65 MB → 0.96 MB (64% reduction) for inference-only deployments.

Size reduction breakdown

Measured on macOS arm64, stripped cdylib, LTO fat, opt-level=s:

main (today, all features)                                     2.65 MB
├── drop deps (derive_builder, itertools, monostate, ahash)   -0.32 MB  (done)
├── feature-gate training/spm/parallel/unicode-norm           -0.34 MB  (done)
├── make regex optional                                       -0.65 MB  (done)
├── drop dary_heap serde feature                              -tiny     (done)
├── codegen-units = 1                                         -0.03 MB  (done, in profile)
├── panic = "abort"                                        \
├── -Zlocation-detail=none -Zfmt-debug=none                 } -0.35 MB combined
├── build-std=std,panic_abort optimize_for_size            /
└───────────────────────────────────────────────
    TOTAL with nightly build-std                               0.96 MB ✓
    TOTAL with stable (codegen-units=1, panic=abort)          ~1.18 MB

Comparison with Meta pytorch/tokenizers (C++)

	Meta (C++)	HuggingFace (Rust, this PR)
Stripped binary	~0.8 MB	0.96 MB (nightly) / 1.18 MB (stable)
Models	BPE, SentencePiece, Tiktoken	BPE, WordPiece, Unigram, WordLevel
Normalizers	Replace, Prepend, NFC, Sequence	All of the above + NFKC/NFD/NFKD, Bert, Lowercase, Strip, StripAccents, ByteLevel, NMT
Pre-tokenizers	Regex, Digits, ByteLevel, Sequence	All of the above + Whitespace, Metaspace, Punctuation, Split, UnicodeScripts, BertPreTokenizer, CharDelimiter, FixedLength
Post-processors	TemplateProcessing, Sequence	All of the above + BertProcessing, RobertaProcessing, ByteLevel
Decoders	Basic token decoder	BPE, ByteLevel, WordPiece, Metaspace, CTC, ByteFallback, Fuse, Strip, Replace, Sequence
Added tokens	Basic special tokens only	Full `AddedToken` with single-word, lstrip/rstrip, normalized, per-token config
Training	❌	✅ (feature-gated)
Padding / Truncation	❌	✅
Batch encoding	❌	✅ (with optional rayon parallelism)
`tokenizer.json` loading	✅ (partial — many post-processors are TODO)	✅ (full)

We're ~20% larger than Meta's C++ implementation while supporting significantly more features. The gap is primarily serde's JSON deserialization infrastructure (~225 KB) which Meta avoids by using hardcoded loaders.

What changed

Dependency cleanup:

Replaced derive_builder with manual builders
Replaced itertools with std equivalents
Replaced ahash with foldhash (smaller, faster on benchmarks)
Dropped monostate
Removed serde feature from dary_heap (never serialized)

Feature gates (all backward-compatible, enabled by default):

training — gates rand, esaxx-rs, compact_str, all trainer impls
spm — gates spm_precompiled, unicode-segmentation
parallel — gates rayon, rayon-cond
unicode-normalization — gates NFC/NFD/NFKC/NFKD normalizers
regex — new: gates the Rust regex crate. When disabled, all regex operations use the system regex engine (onig or fancy-regex). Replaced 4 regex statics in added_vocabulary.rs with char ops, Whitespace pre-tokenizer with SysRegex, Pattern for &str with str::match_indices.

Build profiles:

Added release-small profile with opt-level = "s", strip = true, panic = "abort", codegen-units = 1
Documented nightly build-std command for sub-1MB builds

CI:

Added bundle size reporting via $GITHUB_STEP_SUMMARY to Rust and Python release workflows
Fixed macOS abi3 cross-compilation RUSTFLAGS in CI

Minimal inference-only configuration

# Cargo.toml — smallest possible, uses Oniguruma regex (C dep)
tokenizers = { version = "0.22", default-features = false, features = ["onig"] }

# Or pure Rust for WASM (no C dependencies)
tokenizers = { version = "0.22", default-features = false, features = ["unstable_wasm"] }

For the absolute smallest binary (nightly):

RUSTFLAGS="-Zlocation-detail=none -Zfmt-debug=none" cargo +nightly build \
  -Z build-std=std,panic_abort -Z build-std-features="optimize_for_size" \
  --target aarch64-apple-darwin --profile release-small

Benchmark results (ahash → foldhash)

Foldhash is equal-or-faster on all benchmarks. Training is 7-10% faster, encoding is 2-3% faster.

Benchmark	ahash (main)	foldhash (this PR)	Delta
BPE GPT2 encode	1.718 s	1.668 s	-3.0%
BPE train large	958 ms	883 ms	-7.8%
llama3 encode	1.687 s	1.644 s	-2.6%
llama3 train big	1.124 s	1.016 s	-9.6%
unigram train big	675 ms	614 ms	-9.1%
BERT encode	1.560 s	1.558 s	-0.1%
BERT train big	901 ms	871 ms	-3.3%

…e_builder and itertools - Remove `derive_builder` dep (5.6 MB rlib) - replace with manual builders for UnigramTrainer, WordLevelTrainer, and TemplateProcessing - Remove `itertools` dep (2.6 MB rlib) - replace with manual dedup in CTC decoder, Box<dyn Iterator> in train_from_files, and Vec::join in template validation - Add `training` feature flag (default on) - gates all trainer code, rand, esaxx-rs - Add `spm` feature flag (default on) - gates spm_precompiled/nom/base64 Results with default features (full backward compat): - Direct deps: 27 -> 24 (-3) - Transitive deps: 119 -> 100 (-19 crates) - Zero new warnings Results with --no-default-features --features "onig" (inference-only): - Direct deps: 20 - Transitive deps: 81 (-38 crates vs original) - rlib: 11.6 MB (down from 13.6 MB)

…ash/monostate Change 1: Slim regex Unicode features (-6.2 MB rlib) - Use only unicode-perl instead of full unicode support - Only \p{L}, \p{N}, \w, \s are used in the codebase Change 2: Feature-gate rayon ("parallel" feature, default on) (-9.4 MB rlib) - rayon + rayon-cond are now optional behind "parallel" feature - Serial-only fallback for on-device (single-core) deployments - Eliminates transitive itertools dependency from rayon-cond Change 3: Replace monostate with impl_serde_type! macro (-0.4 MB rlib) - ByteFallback and Fuse now use existing impl_serde_type! macro - Removes monostate + monostate-impl dependencies Change 4: Replace ahash with foldhash (-16 MB rlib) - foldhash is 116 KB with zero deps vs ahash 291 KB + zerocopy 15.7 MB - AHashMap/AHashSet type aliases now use foldhash::fast::FixedState - Eliminates zerocopy + zerocopy-derive (proc-macro bloat) Combined results for inference-only (--no-default-features --features "onig"): - Runtime dep rlib total: 84.6 MB -> 55.2 MB (-29.4 MB, -35%) - Excluding compile-time-only proc-macros: 44.8 MB (below 50 MB target) - Transitive deps: 81 -> 57 (-24 crates)

…tation - Add `unicode-normalization` feature (default on) gating NFC/NFD/NFKC/NFKD normalizers and the unicode-normalization-alignments dep (2.5 MB rlib) - Make `compact_str` optional, only pulled in by `training` feature (0.9 MB) - Make `unicode-segmentation` optional, only pulled in by `spm` feature (1.2 MB) Inference-only rlib total: 55.2 MB -> 47.1 MB (-8.1 MB) Excluding proc-macros: 36.7 MB Transitive deps: 57 -> 29 (inference-only)

…ment guide - Document all feature flags with what deps they save - Add on-device/embedded configuration examples - Add measured bundle sizes (.dylib vs .a vs final link) - Add comparison with Meta pytorch/tokenizers (C++) - Add step-by-step instructions to measure bundle size - Add CI regression test script

…singBuilder API - Replace `use ahash::AHashMap` with `use tokenizers::utils::AHashMap` in Python binding, Node binding, and integration tests - Remove `ahash` direct dependency from Python and Node binding Cargo.toml - Add `.single()` and `.pair()` methods to TemplateProcessingBuilder (non-try versions needed by Python binding)

Add crate/wheel size reporting via $GITHUB_STEP_SUMMARY to both Rust and Python release workflows. Run cargo fmt across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

HuggingFaceDocBuilderDev · 2026-04-10T09:45:45Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ArthurZucker · 2026-04-10T13:49:17Z

/benchmark

ArthurZucker · 2026-04-10T14:59:33Z

/benchmark

ArthurZucker · 2026-04-10T15:38:40Z

/benchmark

github-actions · 2026-04-10T15:44:54Z

Python Benchmark Results

Commit: 994022f1f942484156d7b339c27e6bfebbdacb18

github-actions · 2026-04-10T15:46:36Z

Rust Benchmark Results

Commit: 994022f1f942484156d7b339c27e6bfebbdacb18

…ll profile - Make `regex` crate an optional dependency (feature-gated, on by default). When disabled, all regex usage replaced with char ops + SysRegex (onig/fancy-regex). Saves ~650 KB in the linked binary. - Replace regex::Regex in added_vocabulary with char operations (is_word_char) - Replace regex::Regex in Whitespace pre-tokenizer with SysRegex - Replace Pattern for &str impl: regex → str::match_indices - Add regex_escape() and is_word_char() utilities - Gate Pattern for &regex::Regex behind #[cfg(feature = "regex")] - Python bindings: add strip = true and lto = "fat" to release profile (7.66 MB → 5.55 MB, -27.5%) - Add release-small profile (opt-level=s, strip, panic=abort, codegen-units=1) - Drop unused serde feature from dary_heap Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Bundles onig + unicode-normalization + spm — all inference capabilities without training, parallel, regex, or progressbar. Build with: cargo build --profile release-small --no-default-features --features inference Measured: 1.45 MB (stable, panic=abort, codegen-units=1, LTO fat, opt-s, strip) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Route wrapper enum deserialization through `from_str` instead of `serde_json::from_value`. This eliminates 21 monomorphized copies of the Value deserializer infrastructure (~66 KB savings). The tradeoff is one extra Value→String→T roundtrip at tokenizer load time (negligible for a one-time operation). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…rphization" This reverts commit 8976c12.

The .rlib includes unused code that the final linker strips — it's not representative of on-device size. Build a minimal cdylib that links tokenizers with each feature set and measure the stripped output. For Python wheels, also extract the wheel and report the installed .so/.pyd size (what actually loads at runtime) in addition to the compressed wheel size. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ArthurZucker and others added 6 commits April 9, 2026 17:27

ci: add bundle size reporting to release workflows and fix formatting

36b9d10

Add crate/wheel size reporting via $GITHUB_STEP_SUMMARY to both Rust and Python release workflows. Run cargo fmt across the codebase. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

ArthurZucker and others added 3 commits April 10, 2026 12:12

fix: clippy lints and macOS abi3 cross-compilation RUSTFLAGS in CI

e3099f7

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: cargo fmt for node bindings

7ba643d

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

fix: sync lib.rs doc comments with README for cargo readme check

2da83c5

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge branch 'main' into reduce-crate-size

a2377e3

Merge branch 'main' into reduce-crate-size

994022f

ArthurZucker and others added 5 commits April 11, 2026 08:31

Revert "feat: replace from_value with from_str to reduce serde monomo…

73dd2ba

…rphization" This reverts commit 8976c12.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce crate size#2015

Reduce crate size#2015
ArthurZucker wants to merge 16 commits intomainfrom
reduce-crate-size

ArthurZucker commented Apr 9, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Reduce tokenizers crate size

Size reduction breakdown

Comparison with Meta pytorch/tokenizers (C++)

What changed

Minimal inference-only configuration

Benchmark results (ahash → foldhash)

Uh oh!

HuggingFaceDocBuilderDev commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Python Benchmark Results

Uh oh!

github-actions Bot commented Apr 10, 2026

Rust Benchmark Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ArthurZucker commented Apr 9, 2026 •

edited

Loading