feat: remove string fmt allocation in hot loop in `BpeBuilder::build` by McPatate · Pull Request #2010 · huggingface/tokenizers

McPatate · 2026-04-07T16:29:37Z

was running samply on encode_batch_fast and noticed that init was long, so digging deeper found that BpeBuilder::build was doing let new_token = format!("{}{}", a, &b[prefix_len..]); in a hot loop. Pre-allocating a buffer and writing into it directly resulted in a ~48% perf boost!

Test script:

fn main() -> Result<(), Box<dyn std::error::Error + Send + Sync>> {
    let tokenizer = Tokenizer::from_file("data/llama-3-tokenizer.json")?;
    let data = std::fs::read_to_string("data/big.txt")?;
    let lines: Vec<&str> = data.lines().collect();

    eprintln!("=== Batch encode_fast (lines) ===");
    let batch: Vec<_> = lines.clone();
    let _ = tokenizer.encode_batch_fast(batch, false)?;

    eprintln!("Done.");
    Ok(())
}

used:

$ cargo build --release --example profile_encode
$ samply record ./target/release/example/profile_encode

to find & measure the change

Before:

After:

36ms -> 25ms so ~45% faster

HuggingFaceDocBuilderDev · 2026-04-07T16:33:02Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker

LGTM down to add cargo bench for the full loading a tokenier from pretrained

ArthurZucker · 2026-04-10T10:45:55Z

/benchmark

github-actions · 2026-04-10T10:49:18Z

Python Benchmark results

Commit: b671e0d104c55542c8c791d3e94059e86ffd26ea

Benchmark	Baseline (ms)	This run (ms)	Δ
test_async_encode_batch	1305.2	1403.5	+7.5%
test_async_encode_batch_fast	1054.6	1129.2	+7.1%
test_decode_batch	2.4	2.8	+20.6%
test_encode	2545.9	2583.8	+1.5%
test_encode_batch	1301.0	1409.1	+8.3%
test_encode_batch_multithreaded	1289.6	1377.1	+6.8%
test_encode_fast	1043.3	1136.1	+8.9%
test_from_file_albert	45.4	49.6	+9.3%
test_from_file_llama3	408.7	430.1	+5.3%
test_from_file_roberta	76.1	74.6	-1.9%
test_from_str_llama3	389.0	412.2	+6.0%
test_to_str_llama3	107.2	100.3	-6.4%
test_train_bpe_small	16.2	15.4	-5.1%

github-actions · 2026-04-10T10:53:04Z

Rust Benchmark results

Commit: b671e0d104c55542c8c791d3e94059e86ffd26ea

Benchmark	Baseline (ns/iter)	This run (ns/iter)	Δ
bpe-gpt2/encode	1815016018	1877243893	+3%
bpe-gpt2/encode-batch	883721924	859022111	-2%
bpe-gpt2/encode-batch-no-cache	1024733230	1005161750	-1%
bpe-gpt2/encode-no-cache	2345818394	2423803297	+3%
llama3/concurrent-4t	76814529	49998387	-34%
llama3/encode	1754898015	1757227046	0%
llama3/encode-batch	867783684	849115206	-2%
llama3/encode-char-offsets	1067309310	1059360128	0%
llama3/encode-fast	1672139715	1671739252	0%
serialization/bpe-from-file-gpt2	47651117	43944420	-7%
serialization/deserialize-llama3	405279321	407202897	0%
serialization/deserialize-roberta	74238789	75087068	+1%
serialization/from-file-albert	36663177	36297092	0%
serialization/from-file-llama3	371594895	369749169	0%
serialization/from-file-roberta	62753817	63848716	+1%
serialization/save-llama3	109097437	98212058	-9%
train/bpe-small	17622182	17126709	-2%

feat: remove string fmt allocation in hot loop in BPEBuilder::build

a8ffa4c

McPatate requested a review from ArthurZucker April 7, 2026 16:29

ArthurZucker approved these changes Apr 8, 2026

View reviewed changes

Merge branch 'main' into feat/remove_allocation_bpebuilder_hot_loop

b671e0d

ArthurZucker merged commit 51a6e82 into main Apr 10, 2026
38 checks passed

ArthurZucker deleted the feat/remove_allocation_bpebuilder_hot_loop branch April 10, 2026 13:47

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: remove string fmt allocation in hot loop in `BpeBuilder::build`#2010

feat: remove string fmt allocation in hot loop in `BpeBuilder::build`#2010
ArthurZucker merged 2 commits intomainfrom
feat/remove_allocation_bpebuilder_hot_loop

McPatate commented Apr 7, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

ArthurZucker left a comment •

edited

Loading

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

McPatate commented Apr 7, 2026

Uh oh!

HuggingFaceDocBuilderDev commented Apr 7, 2026

Uh oh!

ArthurZucker left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ArthurZucker commented Apr 10, 2026

Uh oh!

github-actions Bot commented Apr 10, 2026

Python Benchmark results

Uh oh!

github-actions Bot commented Apr 10, 2026

Rust Benchmark results

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ArthurZucker left a comment •

edited

Loading