-
Notifications
You must be signed in to change notification settings - Fork 1k
Description
Summary
On systems with 64+ CPU cores, encode_batch achieves significantly lower throughput than using direct par_iter() on the same tokenizer. The difference is 6x on a 128-core system.
Environment
- tokenizers: 0.22.2 (Rust crate)
- System: 128-core Intel Xeon (dual 8462Y+)
- OS: Linux 5.15
Reproduction
Update: the script below was a false positive that cannot be reproduced. please check out the script in the original report #1900 (manual sharding of tokenizer instances / worker-group approach) that works consistently.
use tokenizers::Tokenizer;
use rayon::prelude::*;
use std::time::Instant;
fn main() -> Result<(), Box<dyn std::error::Error>> {
let tokenizer = Tokenizer::from_pretrained("Qwen/Qwen2.5-7B-Instruct", None)?;
// Generate test data: 200 documents of 100KB each
let data: Vec<String> = (0..200)
.map(|_| "hello world ".repeat(8333)) // ~100KB per doc
.collect();
let texts: Vec<&str> = data.iter().map(|s| s.as_str()).collect();
// Method 1: encode_batch
let start = Instant::now();
let _ = tokenizer.encode_batch(texts.clone(), false)?;
println!("encode_batch: {:?}", start.elapsed());
// Method 2: Direct par_iter (same tokenizer instance)
let start = Instant::now();
let _: Result<Vec<_>, _> = texts.par_iter()
.map(|text| tokenizer.encode(*text, false))
.collect();
println!("par_iter: {:?}", start.elapsed());
Ok(())
}Add to Cargo.toml:
[dependencies]
tokenizers = { version = "0.22", features = ["http"] }
rayon = "1.10"Results (64 threads, 200 docs × 100KB)
| Method | Time | Throughput |
|---|---|---|
encode_batch |
5.04s | 2.6M tok/s |
Direct par_iter |
0.83s | 15.8M tok/s |
| Speedup | 6.1x |
perf stat Analysis
| Metric | encode_batch | par_iter | Ratio |
|---|---|---|---|
| Elapsed time | 5.58s | 2.46s | 2.3x slower |
| Context switches | 107,100 | 21,404 | 5x more |
| CPU migrations | 11,710 | 683 | 17x more |
| Sys time | 17.7s | 3.6s | 4.9x more |
| Effective GHz | 1.09 | 2.03 | 1.9x lower |
| IPC | 1.07 | 1.42 | lower |
The high context switch and CPU migration counts suggest scheduling overhead in the parallelization wrapper.
Root Cause
encode_batch uses into_maybe_par_iter() (via maybe_parallel feature and rayon_cond crate):
// tokenizer/mod.rs line ~1286
let mut encodings = inputs
.into_maybe_par_iter()
.map(|input| self.encode(input, add_special_tokens))
.collect::<Result<Vec<Encoding>>>()?;This wraps Rayon's parallel iterator in CondIterator to support the TOKENIZERS_PARALLELISM environment variable. The wrapper appears to introduce significant overhead on high-core systems.
Note on TOKENIZERS_PARALLELISM
The TOKENIZERS_PARALLELISM feature is valuable for avoiding thread explosion in nested parallel contexts. The issue is not with the feature itself, but with the performance overhead of the current implementation at high core counts.
Possible Non-Breaking Improvements
-
Internal optimization: Investigate and optimize the
rayon_cond/CondIteratorimplementation for high-core systems. This would be completely transparent to users. -
Check once and branch: Instead of wrapping each iterator operation, check
TOKENIZERS_PARALLELISMonce at the start ofencode_batchand call either parallel or sequential code path:if current_parallelism() { inputs.into_par_iter().map(...).collect() } else { inputs.into_iter().map(...).collect() }
This preserves the feature while avoiding per-operation wrapper overhead.
-
Add optional method (additive, non-breaking): Add
encode_batch_direct()or similar that uses direct Rayon without the wrapper, for users who know they don't need nested parallelism protection.
Workaround
Users can bypass encode_batch and use direct Rayon:
use rayon::prelude::*;
let encodings: Result<Vec<_>, _> = texts.par_iter()
.map(|text| tokenizer.encode(text, false))
.collect();