Skip to content

Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard#2030

Open
ArthurZucker wants to merge 1 commit intomainfrom
full-bench
Open

Real-world (batch × input_length) tokenizer benchmark + cross-library leaderboard#2030
ArthurZucker wants to merge 1 commit intomainfrom
full-bench

Conversation

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Summary

  • Rewrites bindings/python/benches/test_tiktoken.py into a standardized (batch_size × input_length) matrix bench mirroring fastokens and wordchipper ablation knobs. Samples are sourced from zai-org/LongBench-v2 and truncated/repeated to exact token lengths (same helper as fastokens' _adjust_tokens).
  • Five backends on a uniform encode/decode API, gracefully skipped when unavailable:
  • Both encode and decode are timed (best of warmup + iters). rich renders live colored tables with per-row winner and geo-mean summary panels.
  • Adds --hf-models … to sweep arbitrary HF repo ids (Qwen / DeepSeek / GLM-4.5 / Mistral-Nemo / Yi / starcoder2 / gpt-neox / falcon / Llama-3, etc.) and print a cross-model leaderboard.
  • Cross-backend correctness probe runs before timing. All five backends agree on the canonical probe for cl100k_base / o200k_base; tokenizers + tiktoken + iree agree on llama-3.
  • Fairness preflight prints CPU model, load avg, pinned CPUs, governor; --strict-fairness aborts above 50% nproc. Results can be saved via --save-json / --save-md.

Rust side

Adds a new Criterion bench tokenizers/benches/matrix_benchmark.rs that sweeps the same (batch, input_length) matrix and measures:

  • encode_batch (with offsets)
  • encode_batch_fast (no offsets)
  • decode_batch — the prior suite (ci_benchmark, llama3_benchmark) had no decode bench and no parametric matrix.

Env-configurable via MATRIX_BATCH_SIZES / MATRIX_INPUT_LENGTHS. Registered in tokenizers/Cargo.toml.

Headline findings (pinned 8 cores on AMD EPYC 7R13, llama-3 for apples-to-apples)

Python vs Rust overhead (bs=128, len=8192):

phase Python Rust python/rust
encode (fast) 27.1 MB/s 36.0 MB/s 75%
decode 9.9 Mtok/s 58.5 Mtok/s 17%

10-model Python leaderboard — decode is the biggest optimization opportunity: iree beats tokenizers on decode by a consistent 5.5–7.5× across every non-OpenAI model (Qwen2.5/3, DeepSeek-V3, GLM-4.5, Mistral-Nemo, Yi-1.5, starcoder2, gpt-neox, falcon, Llama-3). On encode we are competitive or ahead on 6/10 models.

Full tables + raw JSON/logs: https://gist.github.com/ArthurZucker/b5f60b51af22ecd62b16939db25efc5f

Test plan

  • pip install rich tiktoken iree-tokenizer wordchipper bpe-openai then python bindings/python/benches/test_tiktoken.py -e cl100k_base -b 1 32 -l 128 1024 -p 6 --iters 2 --warmup 1
  • python bindings/python/benches/test_tiktoken.py --hf-models -b 1 32 -l 128 2048 --backends tokenizers iree tiktoken -p 8 --iters 2 --warmup 1 --save-md /tmp/out.md
  • cd tokenizers && make data/llama-3-tokenizer.json data/big.txt && cargo bench --bench matrix_benchmark -- --warm-up-time 1 --measurement-time 3
  • Verify the new Criterion groups appear: matrix/encode-batch, matrix/encode-batch-fast, matrix/decode-batch.

🤖 Generated with Claude Code

Co-Authored-By: Claude Opus 4.7 (1M context) noreply@anthropic.com

…derboard

Rewrites the tiktoken comparison bench into a standardized
(batch_size × input_length) sweep mirroring the knobs used by fastokens'
`examples/ablation.sh` and wordchipper's fineweb batch bench. Samples are
pulled from `zai-org/LongBench-v2` and truncated/repeated per prompt to hit
exact token lengths (same helper as fastokens' `_adjust_tokens`).

**Python side — `bindings/python/benches/test_tiktoken.py`**
- Five backends on a uniform encode/decode API, skipped gracefully if
  unavailable: `tokenizers`, `tiktoken`, `wordchipper`
  (https://github.com/zspacelabs/wordchipper), `iree.tokenizer`
  (https://github.com/iree-org/iree-tokenizer-py), `bpe` via `bpe-openai`
  (https://github.com/github/rust-gems).
- Accepts OpenAI encoding names (`cl100k_base`, `o200k_base`, `gpt2`, `llama3`)
  or any HF repo id. `--hf-models` iterates a list and prints a cross-model
  leaderboard.
- Both encode and decode are timed (best of warmup+iters); `rich` renders
  live colored tables with per-row winner and a geo-mean summary.
- Cross-backend correctness probe before timing.
- Fairness preflight (CPU model, load avg, pinned CPUs, governor) with an
  optional `--strict-fairness` abort above 50% nproc.
- `--save-json` / `--save-md` serialize full results + a markdown leaderboard.

**Rust side — `tokenizers/benches/matrix_benchmark.rs`**
- New Criterion bench that sweeps the same (batch, input_length) matrix and
  measures `encode_batch`, `encode_batch_fast`, and **`decode_batch`** — the
  prior suite (`ci_benchmark`, `llama3_benchmark`) had no decode coverage
  and no parametric matrix.
- Matrix is env-configurable via `MATRIX_BATCH_SIZES`, `MATRIX_INPUT_LENGTHS`.
- Registered in `tokenizers/Cargo.toml`.
@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants