`encode_batch` has suboptimal parallelization on high-core systems

## Summary

On systems with 64+ CPU cores, `encode_batch` achieves significantly lower throughput than using direct `par_iter()` on the same tokenizer. The difference is **6x** on a 128-core system.

## Environment

- **tokenizers**: 0.22.2 (Rust crate)
- **System**: 128-core Intel Xeon (dual 8462Y+)
- **OS**: Linux 5.15

## Reproduction

**Update**: the script below was a false positive that cannot be reproduced. please check out the script in the original report https://github.com/huggingface/tokenizers/issues/1900 (manual sharding of tokenizer instances / worker-group approach) that works consistently.

```rust
use tokenizers::Tokenizer;
use rayon::prelude::*;
use std::time::Instant;

fn main() -> Result<(), Box<dyn std::error::Error>> {
    let tokenizer = Tokenizer::from_pretrained("Qwen/Qwen2.5-7B-Instruct", None)?;

    // Generate test data: 200 documents of 100KB each
    let data: Vec<String> = (0..200)
        .map(|_| "hello world ".repeat(8333))  // ~100KB per doc
        .collect();
    let texts: Vec<&str> = data.iter().map(|s| s.as_str()).collect();

    // Method 1: encode_batch
    let start = Instant::now();
    let _ = tokenizer.encode_batch(texts.clone(), false)?;
    println!("encode_batch: {:?}", start.elapsed());

    // Method 2: Direct par_iter (same tokenizer instance)
    let start = Instant::now();
    let _: Result<Vec<_>, _> = texts.par_iter()
        .map(|text| tokenizer.encode(*text, false))
        .collect();
    println!("par_iter:     {:?}", start.elapsed());

    Ok(())
}
```

Add to Cargo.toml:
```toml
[dependencies]
tokenizers = { version = "0.22", features = ["http"] }
rayon = "1.10"
```

## Results (64 threads, 200 docs × 100KB)

| Method | Time | Throughput |
|--------|------|------------|
| `encode_batch` | 5.04s | 2.6M tok/s |
| Direct `par_iter` | 0.83s | 15.8M tok/s |
| **Speedup** | **6.1x** | |

## perf stat Analysis

| Metric | encode_batch | par_iter | Ratio |
|--------|-------------|----------|-------|
| Elapsed time | 5.58s | 2.46s | 2.3x slower |
| Context switches | 107,100 | 21,404 | **5x more** |
| CPU migrations | 11,710 | 683 | **17x more** |
| Sys time | 17.7s | 3.6s | **4.9x more** |
| Effective GHz | 1.09 | 2.03 | 1.9x lower |
| IPC | 1.07 | 1.42 | lower |

The high context switch and CPU migration counts suggest scheduling overhead in the parallelization wrapper.

## Root Cause

`encode_batch` uses `into_maybe_par_iter()` (via `maybe_parallel` feature and `rayon_cond` crate):

```rust
// tokenizer/mod.rs line ~1286
let mut encodings = inputs
    .into_maybe_par_iter()
    .map(|input| self.encode(input, add_special_tokens))
    .collect::<Result<Vec<Encoding>>>()?;
```

This wraps Rayon's parallel iterator in `CondIterator` to support the `TOKENIZERS_PARALLELISM` environment variable. The wrapper appears to introduce significant overhead on high-core systems.

## Note on TOKENIZERS_PARALLELISM

The `TOKENIZERS_PARALLELISM` feature is valuable for avoiding thread explosion in nested parallel contexts. The issue is not with the feature itself, but with the **performance overhead** of the current implementation at high core counts.

## Possible Non-Breaking Improvements

1. **Internal optimization**: Investigate and optimize the `rayon_cond` / `CondIterator` implementation for high-core systems. This would be completely transparent to users.

2. **Check once and branch**: Instead of wrapping each iterator operation, check `TOKENIZERS_PARALLELISM` once at the start of `encode_batch` and call either parallel or sequential code path:
   ```rust
   if current_parallelism() {
       inputs.into_par_iter().map(...).collect()
   } else {
       inputs.into_iter().map(...).collect()
   }
   ```
   This preserves the feature while avoiding per-operation wrapper overhead.

3. **Add optional method** (additive, non-breaking): Add `encode_batch_direct()` or similar that uses direct Rayon without the wrapper, for users who know they don't need nested parallelism protection.

## Workaround

Users can bypass `encode_batch` and use direct Rayon:

```rust
use rayon::prelude::*;

let encodings: Result<Vec<_>, _> = texts.par_iter()
    .map(|text| tokenizer.encode(text, false))
    .collect();
```

## Related

- #1900 (original report)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`encode_batch` has suboptimal parallelization on high-core systems #1929

Summary

Environment

Reproduction

Results (64 threads, 200 docs × 100KB)

perf stat Analysis

Root Cause

Note on TOKENIZERS_PARALLELISM

Possible Non-Breaking Improvements

Workaround

Related

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Method	Time	Throughput
`encode_batch`	5.04s	2.6M tok/s
Direct `par_iter`	0.83s	15.8M tok/s
Speedup	6.1x

Metric	encode_batch	par_iter	Ratio
Elapsed time	5.58s	2.46s	2.3x slower
Context switches	107,100	21,404	5x more
CPU migrations	11,710	683	17x more
Sys time	17.7s	3.6s	4.9x more
Effective GHz	1.09	2.03	1.9x lower
IPC	1.07	1.42	lower

encode_batch has suboptimal parallelization on high-core systems #1929

Description

Summary

Environment

Reproduction

Results (64 threads, 200 docs × 100KB)

perf stat Analysis

Root Cause

Note on TOKENIZERS_PARALLELISM

Possible Non-Breaking Improvements

Workaround

Related

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

`encode_batch` has suboptimal parallelization on high-core systems #1929