Skip to content

encode_batch has suboptimal parallelization on high-core systems #1929

@stargazerZJ

Description

@stargazerZJ

Summary

On systems with 64+ CPU cores, encode_batch achieves significantly lower throughput than using direct par_iter() on the same tokenizer. The difference is 6x on a 128-core system.

Environment

  • tokenizers: 0.22.2 (Rust crate)
  • System: 128-core Intel Xeon (dual 8462Y+)
  • OS: Linux 5.15

Reproduction

Update: the script below was a false positive that cannot be reproduced. please check out the script in the original report #1900 (manual sharding of tokenizer instances / worker-group approach) that works consistently.

use tokenizers::Tokenizer;
use rayon::prelude::*;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let tokenizer = Tokenizer::from_pretrained("Qwen/Qwen2.5-7B-Instruct", None)?;

    // Generate test data: 200 documents of 100KB each
    let data: Vec<String> = (0..200)
        .map(|_| "hello world ".repeat(8333))  // ~100KB per doc
        .collect();
    let texts: Vec<&str> = data.iter().map(|s| s.as_str()).collect();

    // Method 1: encode_batch
    let start = Instant::now();
    let _ = tokenizer.encode_batch(texts.clone(), false)?;
    println!("encode_batch: {:?}", start.elapsed());

    // Method 2: Direct par_iter (same tokenizer instance)
    let start = Instant::now();
    let _: Result<Vec<_>, _> = texts.par_iter()
        .map(|text| tokenizer.encode(*text, false))
        .collect();
    println!("par_iter:     {:?}", start.elapsed());

    Ok(())
}

Add to Cargo.toml:

[dependencies]
tokenizers = { version = "0.22", features = ["http"] }
rayon = "1.10"

Results (64 threads, 200 docs × 100KB)

Method Time Throughput
encode_batch 5.04s 2.6M tok/s
Direct par_iter 0.83s 15.8M tok/s
Speedup 6.1x

perf stat Analysis

Metric encode_batch par_iter Ratio
Elapsed time 5.58s 2.46s 2.3x slower
Context switches 107,100 21,404 5x more
CPU migrations 11,710 683 17x more
Sys time 17.7s 3.6s 4.9x more
Effective GHz 1.09 2.03 1.9x lower
IPC 1.07 1.42 lower

The high context switch and CPU migration counts suggest scheduling overhead in the parallelization wrapper.

Root Cause

encode_batch uses into_maybe_par_iter() (via maybe_parallel feature and rayon_cond crate):

// tokenizer/mod.rs line ~1286
let mut encodings = inputs
    .into_maybe_par_iter()
    .map(|input| self.encode(input, add_special_tokens))
    .collect::<Result<Vec<Encoding>>>()?;

This wraps Rayon's parallel iterator in CondIterator to support the TOKENIZERS_PARALLELISM environment variable. The wrapper appears to introduce significant overhead on high-core systems.

Note on TOKENIZERS_PARALLELISM

The TOKENIZERS_PARALLELISM feature is valuable for avoiding thread explosion in nested parallel contexts. The issue is not with the feature itself, but with the performance overhead of the current implementation at high core counts.

Possible Non-Breaking Improvements

  1. Internal optimization: Investigate and optimize the rayon_cond / CondIterator implementation for high-core systems. This would be completely transparent to users.

  2. Check once and branch: Instead of wrapping each iterator operation, check TOKENIZERS_PARALLELISM once at the start of encode_batch and call either parallel or sequential code path:

    if current_parallelism() {
        inputs.into_par_iter().map(...).collect()
    } else {
        inputs.into_iter().map(...).collect()
    }

    This preserves the feature while avoiding per-operation wrapper overhead.

  3. Add optional method (additive, non-breaking): Add encode_batch_direct() or similar that uses direct Rayon without the wrapper, for users who know they don't need nested parallelism protection.

Workaround

Users can bypass encode_batch and use direct Rayon:

use rayon::prelude::*;

let encodings: Result<Vec<_>, _> = texts.par_iter()
    .map(|text| tokenizer.encode(text, false))
    .collect();

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions