Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard by ArthurZucker · Pull Request #2030 · huggingface/tokenizers

ArthurZucker · 2026-04-23T07:11:47Z

Summary

Rewrites bindings/python/benches/test_tiktoken.py into a standardized (batch_size × input_length) matrix bench mirroring fastokens and wordchipper ablation knobs. Samples are sourced from zai-org/LongBench-v2 and truncated/repeated to exact token lengths (same helper as fastokens' _adjust_tokens).
Five backends on a uniform encode/decode API, gracefully skipped when unavailable:
- tokenizers (this repo)
- tiktoken
- wordchipper — https://github.com/zspacelabs/wordchipper
- iree.tokenizer — https://github.com/iree-org/iree-tokenizer-py
- bpe (via bpe-openai) — https://github.com/github/rust-gems/tree/main/crates/bpe
Both encode and decode are timed (best of warmup + iters). rich renders live colored tables with per-row winner and geo-mean summary panels.
Adds --hf-models … to sweep arbitrary HF repo ids (Qwen / DeepSeek / GLM-4.5 / Mistral-Nemo / Yi / starcoder2 / gpt-neox / falcon / Llama-3, etc.) and print a cross-model leaderboard.
Cross-backend correctness probe runs before timing. All five backends agree on the canonical probe for cl100k_base / o200k_base; tokenizers + tiktoken + iree agree on llama-3.
Fairness preflight prints CPU model, load avg, pinned CPUs, governor; --strict-fairness aborts above 50% nproc. Results can be saved via --save-json / --save-md.

Rust side

Adds a new Criterion bench tokenizers/benches/matrix_benchmark.rs that sweeps the same (batch, input_length) matrix and measures:

encode_batch (with offsets)
encode_batch_fast (no offsets)
decode_batch — the prior suite (ci_benchmark, llama3_benchmark) had no decode bench and no parametric matrix.

Env-configurable via MATRIX_BATCH_SIZES / MATRIX_INPUT_LENGTHS. Registered in tokenizers/Cargo.toml.

Headline findings (pinned 8 cores on AMD EPYC 7R13, llama-3 for apples-to-apples)

Python vs Rust overhead (bs=128, len=8192):

phase	Python	Rust	python/rust
encode (fast)	27.1 MB/s	36.0 MB/s	75%
decode	9.9 Mtok/s	58.5 Mtok/s	17%

10-model Python leaderboard — decode is the biggest optimization opportunity: iree beats tokenizers on decode by a consistent 5.5–7.5× across every non-OpenAI model (Qwen2.5/3, DeepSeek-V3, GLM-4.5, Mistral-Nemo, Yi-1.5, starcoder2, gpt-neox, falcon, Llama-3). On encode we are competitive or ahead on 6/10 models.

Full tables + raw JSON/logs: https://gist.github.com/ArthurZucker/b5f60b51af22ecd62b16939db25efc5f

Test plan

pip install rich tiktoken iree-tokenizer wordchipper bpe-openai then python bindings/python/benches/test_tiktoken.py -e cl100k_base -b 1 32 -l 128 1024 -p 6 --iters 2 --warmup 1
python bindings/python/benches/test_tiktoken.py --hf-models -b 1 32 -l 128 2048 --backends tokenizers iree tiktoken -p 8 --iters 2 --warmup 1 --save-md /tmp/out.md
cd tokenizers && make data/llama-3-tokenizer.json data/big.txt && cargo bench --bench matrix_benchmark -- --warm-up-time 1 --measurement-time 3
Verify the new Criterion groups appear: matrix/encode-batch, matrix/encode-batch-fast, matrix/decode-batch.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…derboard Rewrites the tiktoken comparison bench into a standardized (batch_size × input_length) sweep mirroring the knobs used by fastokens' `examples/ablation.sh` and wordchipper's fineweb batch bench. Samples are pulled from `zai-org/LongBench-v2` and truncated/repeated per prompt to hit exact token lengths (same helper as fastokens' `_adjust_tokens`). **Python side — `bindings/python/benches/test_tiktoken.py`** - Five backends on a uniform encode/decode API, skipped gracefully if unavailable: `tokenizers`, `tiktoken`, `wordchipper` (https://github.com/zspacelabs/wordchipper), `iree.tokenizer` (https://github.com/iree-org/iree-tokenizer-py), `bpe` via `bpe-openai` (https://github.com/github/rust-gems). - Accepts OpenAI encoding names (`cl100k_base`, `o200k_base`, `gpt2`, `llama3`) or any HF repo id. `--hf-models` iterates a list and prints a cross-model leaderboard. - Both encode and decode are timed (best of warmup+iters); `rich` renders live colored tables with per-row winner and a geo-mean summary. - Cross-backend correctness probe before timing. - Fairness preflight (CPU model, load avg, pinned CPUs, governor) with an optional `--strict-fairness` abort above 50% nproc. - `--save-json` / `--save-md` serialize full results + a markdown leaderboard. **Rust side — `tokenizers/benches/matrix_benchmark.rs`** - New Criterion bench that sweeps the same (batch, input_length) matrix and measures `encode_batch`, `encode_batch_fast`, and **`decode_batch`** — the prior suite (`ci_benchmark`, `llama3_benchmark`) had no decode coverage and no parametric matrix. - Matrix is env-configurable via `MATRIX_BATCH_SIZES`, `MATRIX_INPUT_LENGTHS`. - Registered in `tokenizers/Cargo.toml`.

HuggingFaceDocBuilderDev · 2026-04-23T07:14:43Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard#2030

Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard#2030
ArthurZucker wants to merge 1 commit intomainfrom
full-bench

ArthurZucker commented Apr 23, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

ArthurZucker commented Apr 23, 2026

Summary

Rust side

Headline findings (pinned 8 cores on AMD EPYC 7R13, llama-3 for apples-to-apples)

Test plan

Uh oh!

HuggingFaceDocBuilderDev commented Apr 23, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants