Implementing Parity-aware BPE by cimeister · Pull Request #1974 · huggingface/tokenizers

cimeister · 2026-03-21T14:30:14Z

Add parity-aware BPE trainer

This PR adds a parity-aware BPE trainer, the algorithm proposed in Foroutan et al. (2025). Standard BPE tends to over-segment low-resource languages; parity-aware BPE addresses this by selecting merges that compress the least compressed language at each step, balancing compression across languages. This should improve cross-lingual fairness in tokenization (in terms of token counts per language).

The trainer produces a standard BPE model. The parity-aware logic only affects training, not inference. A trained tokenizer is fully compatible with the existing Tokenizer.from_file() / tokenizer.save() workflow.

What's included

Core Rust implementation (tokenizers/src/models/bpe/parity_trainer.rs, gated behind the optional parity-aware-bpe cargo feature, off by default in the tokenizers crate):

Multi-language BPE trainer with two selection variants:
- base: selects the language with the longest total dev-set token length (or furthest from target compression ratio)
- window: moving-window mechanism to prevent any language from dominating merge selections
feed_language_from_iter / feed_dev_language_from_iter — per-language analogues of BpeTrainer::feed with the same <I, S, F> generics including Send / Sync bounds for parallel iteration via maybe_par_bridge
No new dependencies in the core crate beyond the parity-aware-bpe = [] feature flag

Python bindings (bindings/python/src/trainers.rs, on by default in the tokenizers-python crate):

ParityBpeTrainer class with a single training entry point:
```
trainer.train_from_iterator(tokenizer, train_iterators, dev_iterators=None, ratio=None)
```
the multi-corpus analogue of Tokenizer.train_from_iterator. File I/O, parquet decoding, and config parsing all happen in user-side Python code.
GIL released via py.detach around the actual training work, mirroring PyTokenizer::train_from_iterator
Pickle / repr / __getstate__ / __setstate__ support
No new dependencies in the binding crate

Design decisions

Why ParityBpeTrainer does not implement the Trainer trait. The Trainer::feed<I, S, F> method assumes a single-corpus workflow: it takes one iterator of sequences and accumulates them into a single internal word count map. Parity-aware BPE fundamentally requires separate, labeled per-language corpora — the language-selection heuristic operates on independent Vec<AHashMap<…>> statistics, not a merged map. Rather than providing a silently-broken feed() that would collapse all languages into one, ParityBpeTrainer exposes a parallel iterator API (feed_language_from_iter(lang_idx, iterator, process)) that mirrors Trainer::feed exactly but operates per language, and the Python binding exposes a dedicated train_from_iterator(tokenizer, train_iterators, …) method that handles the per-language orchestration.

For the same reason, ParityBpeTrainer is not added to TrainerWrapper (which is keyed by the Trainer trait).

No public "I already have word counts" API. BpeTrainer's words: AHashMap<CompactString, u64> field is private; the only way to populate it is via Trainer::feed(iter, process). ParityBpeTrainer follows the same pattern: the per-language Vec<AHashMap<CompactString, u64>> is internal state, populated only by feed_language_from_iter / feed_dev_language_from_iter. Map-taking helpers exist as #[cfg(test)]-only fixtures so unit tests can construct populated trainers from literals; they are not part of the public API.

Known follow-ups

Feature parity with BpeTrainer.progress_format. The same line of work that extends BpeTrainer with Indicatif/JsonLines/Silent progress modes has not been mirrored on ParityBpeTrainer.
Performance of find_best_pair_linear and replace_pair_dev. Both are currently O(unique-pairs) and O(unique-dev-words) per merge step respectively. They trade speed for exact lexicographic tie-breaking parity with the Python reference implementation. Could build Pair → Vec<word_id> reverse indices to make faster.
Rust test assertions. The 13 tests in tokenizers/src/models/bpe/parity_trainer.rs::tests use bare .unwrap(). Switching to .expect("descriptive message") would give self-describing CI failures.
CI matrix entry for the parity feature. Because parity_trainer is gated behind #[cfg(feature = "parity-aware-bpe")], cargo test -p tokenizers under default features does not run the 13 parity-specific tests. CI must explicitly run cargo test --features parity-aware-bpe to cover them.

Usage

from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import ParityBpeTrainer

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()

# Plain text files: one iterator per language
def lines(path):
    with open(path) as f:
        yield from f

trainer = ParityBpeTrainer(num_merges=32000, variant="base")
trainer.train_from_iterator(
    tokenizer,
    train_iterators=[lines("en.txt"), lines("de.txt"), lines("zh.txt")],
    dev_iterators=[lines("en_dev.txt"), lines("de_dev.txt"), lines("zh_dev.txt")],
)
tokenizer.save("parity_bpe_tokenizer.json")

Parquet input via pyarrow (file I/O happens in Python, not in Rust):

import pyarrow.parquet as pq

def parquet_text_column(path, column="text"):
    for batch in pq.ParquetFile(path).iter_batches(columns=[column]):
        yield from batch.column(column).to_pylist()

trainer.train_from_iterator(
    tokenizer,
    train_iterators=[parquet_text_column(p) for p in lang_parquets],
    ratio=[1.0, 1.2, 0.9],  # alternative to dev_iterators
)

Verification

Unit tests (21 parity-specific tests, all passing):

Symmetric parity merging, dev-driven selection with inverted priorities
Window variant enforcing fairness, exhausted language handling
Global merge warmup, min frequency filtering
Ratio-based selection (base and window variants)
Partial dev files across languages (some languages with dev, some without)
Serialization roundtrip, ratio length mismatch error
Config parsing: optional ratio field, custom text columns, string-or-vec deserialization
Pickle/repr roundtrip for PyParityBpeTrainer

End-to-end training on a 10-language setup:

Training data: FineWeb2 (one parquet shard per language)
Languages: Arabic, Hindi, Chinese, Thai, Russian, Tamil, Korean, Swahili, Finnish, Khmer
Dev data: FLORES-200 parallel sentences
Pre-tokenization: GPT-4o regex (o200k_base pattern) + ByteLevel encoding
64k merges, base variant

Two training runs were compared:

Dev-set mode: language selection driven by FLORES dev-set token lengths
Ratio mode: language selection driven by target compression ratios (derived from the same FLORES data)

We see that parity-aware BPE leads to much more even compression across languages than a standard BPE tokenizer (trained on the same data with the same pretokenization and vocab count). Per-language compression rates are nearly identical between the ratio and dev set modes, confirming that ratio-based selection correctly approximates dev-set-driven selection. One language (Khmer) shows slightly lower compression than expected. This is attributable to pre-tokenization: the GPT-4o regex splits on whitespace boundaries that don't align with Khmer's writing system(no spaces between words), so the pre-tokenizer produces very long unsplit character sequences that compress less efficiently than languages with whitespace word boundaries.

Citation

@article{foroutan-meister-et-al-2025-parity-aware-bpe,
  title={Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization},
  author={Foroutan, Negar and Meister, Clara and Paul, Debjit and Niklaus, Joel and Ahmadi, Sina and Bosselut, Antoine and Sennrich, Rico},
  url={https://arxiv.org/abs/2508.04796},
  booktitle={arXiv},
  year={2025}
}```

HuggingFaceDocBuilderDev · 2026-03-23T16:20:15Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

ArthurZucker · 2026-03-30T09:46:11Z

Will check it out, looks promising!

ArthurZucker · 2026-04-02T13:39:52Z

+compact_str = "0.9"
+parquet = { version = "55", default-features = false, features = ["arrow", "snap"] }
+arrow = { version = "55", default-features = false }


thats a lot of deps!

ArthurZucker

in general we want to avoid growing deps and growing code in the core.

It would be better if we are able to integrate parity isolated as an optional feature!

ArthurZucker · 2026-04-09T07:44:53Z

+parquet = { version = "55", optional = true, default-features = false, features = ["arrow", "snap"] }
+arrow = { version = "55", optional = true, default-features = false }


I don't think we want to add this many deps

ArthurZucker · 2026-04-09T07:45:20Z

this is in the wrong folder

ArthurZucker · 2026-04-09T07:48:55Z

I am not really sure this makes sense, word count is an abstract concept and this does not follow the standard API we have. We need to find another solution to this.

Parquet are natively supported as well afaik when you do tokenizer.encode/encode_batch.

Splits are determined with offset and encodings do return them. This is a big no

… I/O) Addresses HF maintainer feedback on the original parity-aware-bpe submission: 1. Avoid growing core dependencies — gate the entire feature behind a new `parity-aware-bpe` cargo feature (off by default in the `tokenizers` crate, on by default in `tokenizers-python`). Drop the `parquet` and `arrow` dependencies and the standalone `parity_bpe_train` CLI binary entirely. The binding adds no new dependencies. 2. Conform to the standard trainer API — delete `parity_utils.rs` (the ad-hoc word-count helper that doubled as a parquet reader) and `parity_config.rs` (the JSON config schema). The Python binding's `PyParityBpeTrainer` exposes only `train_from_iterator(tokenizer, train_iterators, dev_iterators=, ratio=)`, the multi-corpus analogue of `Tokenizer.train_from_iterator`. File I/O, parquet decoding, and config parsing all move to user-side Python code. 3. The Rust trainer adds `feed_language_from_iter` / `feed_dev_language_from_iter` methods that mirror `BpeTrainer::feed`'s `<I, S, F>` generics including the `Send`/`Sync` bounds for parallel iteration. The pre-existing map-taking `feed_language` / `feed_dev_language` helpers are now `#[cfg(test)]`-only, matching the way `BpeTrainer`'s `words` field is private and only populated via `Trainer::feed`. There is no public path for "I already have aggregated word counts" — that abstraction is internal to the trainer, exactly as in `BpeTrainer`. The binding's `train_from_iterator` releases the GIL via `py.detach` around the actual training work, mirroring `PyTokenizer::train_from_iterator`. Additional cleanup based on the same review pass: - Validate `window_size > 0`, `alpha > 0`, and per-language ratio values in `do_train()` so misconfiguration fails loudly instead of producing garbage. - Tighten `ParityBpeTrainer` field visibility from `pub` to `pub(crate)` to match `#[non_exhaustive]`. - Promote "all languages exhausted" / "global-merge mode exhausted" log lines from `info!` to `warn!`, and add a post-loop warning when fewer merges were produced than the (`total_symbols`-adjusted) target. - Add 7 Python tests covering the parity API (instantiation defaults and variants, train_from_iterator with and without dev / with ratio, pickle round-trip, length-mismatch error).

cimeister · 2026-04-11T10:25:56Z

Ok, I tried to address both of your comments and took care of some other stuff while at it. Current status:

On the deps / "isolate as an optional feature" point: there's now a parity-aware-bpe cargo feature, off by default in the core tokenizers crate so a basic cargo build should be identical to before. The Python crate keeps it on by default (via a default = ["ext-module", "parity-aware-bpe"] line), i.e., pip install tokenizers still ships the trainer. Users who want just the core binding can opt out though. The parquet and arrow deps are gone entirely, the standalone parity_bpe_train CLI binary is gone, and neither tokenizers/Cargo.toml nor bindings/python/Cargo.toml add any new dependencies, only the feature flag. The parity modules in bpe/mod.rs are gated. Just FYI, I didn't see any parquet support in tokenizer.encode/encode_batch. tokenizer.encode(input, …) and tokenizer.encode_batch(inputs, …) don't natively read parquet. They take strings (or pre-tokenized lists), not file paths. But I'm fine leaving that to the user if that's the preference/convention

On the "word count is an abstract concept and this doesn't follow the standard API" point: So parity_utils.rs is gone now. I looked at BpeTrainer and saw its words: AHashMap<…> field is private and the only way to populate it is via Trainer::feed(iter, process). So I tried to match that pattern: feed_language and feed_dev_language (the map-taking methods that were public before) are now #[cfg(test)]-only test fixtures. The only public way to feed ParityBpeTrainer is feed_language_from_iter / feed_dev_language_from_iter, which mirror BpeTrainer::feed's <I: Iterator<Item=S> + Send, S: AsRef + Send, F: Fn(&str) -> Result<Vec> + Sync> signature exactly just with an extra lang_idx: usize parameter. I hope I understood this the way you meant it...

On the parquet side: I ended up removing PyParityBpeTrainer::train() entirely, both the train_files=[...] and the config=... paths. The single Python entry point is now trainer.train_from_iterator(tokenizer, train_iterators, dev_iterators=None, ratio=None). Basically a multi-corpus analogue of Tokenizer.train_from_iterator. File I/O, parquet decoding, JSON config parsing, all of that lives in user-side Python now.

Stuff I changed proactively:

Added input validation in do_train: fails with a clear error if window_size == 0, if alpha is non-finite or non-positive, or if any ratio element is non-finite/non-positive or the ratio length doesn't match the number of languages.
Bumped two info! log lines about "languages exhausted" up to warn!, since "your training is going to produce a smaller vocab than you asked for" is probably something the user probably wants to see.
Added a post-loop warning when the total number of merges produced is less than what was targeted (with the total_symbols=true adjustment correctly applied there was afalse-positive bug here from the first pass).
Tightened all the pub fields on ParityBpeTrainer to pub(crate). They were pub + #[non_exhaustive] which is a contradictory pair anyway.
The Python train_from_iterator releases the GIL via py.detach(...) around the actual training work, mirroring how PyTokenizer::train_from_iterator does it. Iterator buffering still happens with the GIL held (PyBufferedIterator needs it), but feed + do_train + with_model + add_special_tokens all run GIL-released so other Python threads aren't blocked for the duration of training. I'm by no means an expert here (mostly stuff that Claude pointed out) so please lmk if this is the wrong protocol
Rebased the whole thing onto current HF main rather than onto the old fork tip, and squashed everything into one logical commit so you don't have to wade through 75 commits of intermediate edits (sorry about before with the ugly merge).
Rewrote the Python tests to match the new API: 7 tests now (test_instantiate_defaults, test_instantiate_variants, test_train_from_iterator, test_train_from_iterator_with_dev, test_train_from_iterator_with_ratio, test_can_pickle, test_train_iterators_dev_iterators_length_mismatch). If any of these are overkill, lmk and I can remove them

A couple of design choices that I'd like to flag explicitly so you can push back if you disagree:

Why this is a separate trainer instead of an extension to BpeTrainer (the merge selection is the only thing that's actually different output is a vanilla BPE model; it should be fully compatible with Tokenizer.from_file()). I considered adding a languages field to BpeTrainer and conditionally branching, but it would have meant changing BpeTrainer's public API and Trainer::feed's contract for what looks like a niche use case. Keeping it as a separate trainer feels lower-risk. Happy to revisit if you'd rather have it merged in.
Why ParityBpeTrainer doesn't implement the Trainer trait. Trainer::feed<I, S, F> takes a single iterator. Parity-aware BPE needs Vec<AHashMap<…>> indexed by language. There's currently no way to express "this sequence belongs to language 2 of 30" through a single-iterator interface, which the algorithm needs. My solution was to instead have the trainer sit outside TrainerWrapper and expose feed_language_from_iter directly, with train_from_iterator in the binding handling the per-language orchestration. The shape of feed_language_from_iter is identical to Trainer::feed modulo the lang_idx parameter, so it's not a different convention
Why the binding takes Vec rather than Iterator. The latter would let you stream languages in dynamically but it'd be a pain for the common case (you have N files / N pyarrow streams) and it'd require carrying language identity inside the items somehow. Vec matches how users actually structure multilingual data and matches how Tokenizer.train_from_iterator accepts a list of sources.

Stuff I deliberately left for follow-up PRs, in roughly descending order of how much I think it's worth doing:

progress_format parity with BpeTrainer. The Indicatif / JsonLines / Silent modes recently added to BpeTrainer aren't mirrored on ParityBpeTrainer.
find_best_pair_linear and replace_pair_dev are O(unique pairs) and O(unique dev words) per merge step respectively. This was the only was to ensure exact lexicographic tie-break parity with the Python reference implementation, as far as I could tell. On realistic dev sets this is tractable, but would dominate training time on multi-million-word dev sets. We could build Pair → Vec<word_id> reverse indices (faster but more memory) though.
The 15 Rust unit tests use bare .unwrap(). Should perhaps switch to .expect("descriptive message") to give descriptive CI failures.
CI matrix entry: cargo test -p tokenizers under default features doesn't run the parity tests now (because they're behind the feature gate), so CI needs an explicit cargo test

cimeister · 2026-04-21T17:45:05Z

Ok, I tried to address both of your comments and took care of some other stuff while at it. Current status:

On the deps / "isolate as an optional feature" point: there's now a parity-aware-bpe cargo feature, off by default in the core tokenizers crate so a basic cargo build should be identical to before. The Python crate keeps it on by default (via a default = ["ext-module", "parity-aware-bpe"] line), i.e., pip install tokenizers still ships the trainer. Users who want just the core binding can opt out though. The parquet and arrow deps are gone entirely, the standalone parity_bpe_train CLI binary is gone, and neither tokenizers/Cargo.toml nor bindings/python/Cargo.toml add any new....

Hey @ArthurZucker :) Just wanted to check in about whether you had time to look at the new changes. Sorry for the bother, we were just hoping to have this out before ACL releases conference papers

ArthurZucker reviewed Apr 9, 2026

View reviewed changes

cimeister force-pushed the parity-aware-bpe branch from e21c998 to b5c453d Compare April 11, 2026 09:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implementing Parity-aware BPE#1974

Implementing Parity-aware BPE#1974
cimeister wants to merge 1 commit intohuggingface:mainfrom
Ahmetcanyvz:parity-aware-bpe

cimeister commented Mar 21, 2026 •

edited

Loading

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

ArthurZucker commented Mar 30, 2026

Uh oh!

ArthurZucker Apr 2, 2026

Uh oh!

ArthurZucker left a comment

Uh oh!

ArthurZucker Apr 9, 2026

Uh oh!

ArthurZucker Apr 9, 2026

Uh oh!

ArthurZucker Apr 9, 2026

Uh oh!

cimeister commented Apr 11, 2026

Uh oh!

cimeister commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		parquet = { version = "55", optional = true, default-features = false, features = ["arrow", "snap"] }
		arrow = { version = "55", optional = true, default-features = false }

Conversation

cimeister commented Mar 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Add parity-aware BPE trainer

What's included

Design decisions

Known follow-ups

Usage

Verification

Citation

Uh oh!

HuggingFaceDocBuilderDev commented Mar 23, 2026

Uh oh!

ArthurZucker commented Mar 30, 2026

Uh oh!

ArthurZucker Apr 2, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker left a comment

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

ArthurZucker Apr 9, 2026

Choose a reason for hiding this comment

Uh oh!

cimeister commented Apr 11, 2026

Stuff I changed proactively:

A couple of design choices that I'd like to flag explicitly so you can push back if you disagree:

Stuff I deliberately left for follow-up PRs, in roughly descending order of how much I think it's worth doing:

Uh oh!

cimeister commented Apr 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

cimeister commented Mar 21, 2026 •

edited

Loading