Skip to content

Implementing Parity-aware BPE#1974

Open
cimeister wants to merge 1 commit intohuggingface:mainfrom
Ahmetcanyvz:parity-aware-bpe
Open

Implementing Parity-aware BPE#1974
cimeister wants to merge 1 commit intohuggingface:mainfrom
Ahmetcanyvz:parity-aware-bpe

Conversation

@cimeister
Copy link
Copy Markdown

@cimeister cimeister commented Mar 21, 2026

Add parity-aware BPE trainer

This PR adds a parity-aware BPE trainer, the algorithm proposed in Foroutan et al. (2025). Standard BPE tends to over-segment low-resource languages; parity-aware BPE addresses this by selecting merges that compress the least compressed language at each step, balancing compression across languages. This should improve cross-lingual fairness in tokenization (in terms of token counts per language).

The trainer produces a standard BPE model. The parity-aware logic only affects training, not inference. A trained tokenizer is fully compatible with the existing Tokenizer.from_file() / tokenizer.save() workflow.

What's included

Core Rust implementation (tokenizers/src/models/bpe/parity_trainer.rs, gated behind the optional parity-aware-bpe cargo feature, off by default in the tokenizers crate):

  • Multi-language BPE trainer with two selection variants:
    • base: selects the language with the longest total dev-set token length (or furthest from target compression ratio)
    • window: moving-window mechanism to prevent any language from dominating merge selections
  • feed_language_from_iter / feed_dev_language_from_iter — per-language analogues of BpeTrainer::feed with the same <I, S, F> generics including Send / Sync bounds for parallel iteration via maybe_par_bridge
  • No new dependencies in the core crate beyond the parity-aware-bpe = [] feature flag

Python bindings (bindings/python/src/trainers.rs, on by default in the tokenizers-python crate):

  • ParityBpeTrainer class with a single training entry point:
    trainer.train_from_iterator(tokenizer, train_iterators, dev_iterators=None, ratio=None)
    the multi-corpus analogue of Tokenizer.train_from_iterator. File I/O, parquet decoding, and config parsing all happen in user-side Python code.
  • GIL released via py.detach around the actual training work, mirroring PyTokenizer::train_from_iterator
  • Pickle / repr / __getstate__ / __setstate__ support
  • No new dependencies in the binding crate

Design decisions

Why ParityBpeTrainer does not implement the Trainer trait. The Trainer::feed<I, S, F> method assumes a single-corpus workflow: it takes one iterator of sequences and accumulates them into a single internal word count map. Parity-aware BPE fundamentally requires separate, labeled per-language corpora — the language-selection heuristic operates on independent Vec<AHashMap<…>> statistics, not a merged map. Rather than providing a silently-broken feed() that would collapse all languages into one, ParityBpeTrainer exposes a parallel iterator API (feed_language_from_iter(lang_idx, iterator, process)) that mirrors Trainer::feed exactly but operates per language, and the Python binding exposes a dedicated train_from_iterator(tokenizer, train_iterators, …) method that handles the per-language orchestration.

For the same reason, ParityBpeTrainer is not added to TrainerWrapper (which is keyed by the Trainer trait).

No public "I already have word counts" API. BpeTrainer's words: AHashMap<CompactString, u64> field is private; the only way to populate it is via Trainer::feed(iter, process). ParityBpeTrainer follows the same pattern: the per-language Vec<AHashMap<CompactString, u64>> is internal state, populated only by feed_language_from_iter / feed_dev_language_from_iter. Map-taking helpers exist as #[cfg(test)]-only fixtures so unit tests can construct populated trainers from literals; they are not part of the public API.

Known follow-ups

  • Feature parity with BpeTrainer.progress_format. The same line of work that extends BpeTrainer with Indicatif/JsonLines/Silent progress modes has not been mirrored on ParityBpeTrainer.

  • Performance of find_best_pair_linear and replace_pair_dev. Both are currently O(unique-pairs) and O(unique-dev-words) per merge step respectively. They trade speed for exact lexicographic tie-breaking parity with the Python reference implementation. Could build Pair → Vec<word_id> reverse indices to make faster.

  • Rust test assertions. The 13 tests in tokenizers/src/models/bpe/parity_trainer.rs::tests use bare .unwrap(). Switching to .expect("descriptive message") would give self-describing CI failures.

  • CI matrix entry for the parity feature. Because parity_trainer is gated behind #[cfg(feature = "parity-aware-bpe")], cargo test -p tokenizers under default features does not run the 13 parity-specific tests. CI must explicitly run cargo test --features parity-aware-bpe to cover them.

Usage

from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import ParityBpeTrainer

tokenizer = Tokenizer(BPE())
tokenizer.pre_tokenizer = ByteLevel()

# Plain text files: one iterator per language
def lines(path):
    with open(path) as f:
        yield from f

trainer = ParityBpeTrainer(num_merges=32000, variant="base")
trainer.train_from_iterator(
    tokenizer,
    train_iterators=[lines("en.txt"), lines("de.txt"), lines("zh.txt")],
    dev_iterators=[lines("en_dev.txt"), lines("de_dev.txt"), lines("zh_dev.txt")],
)
tokenizer.save("parity_bpe_tokenizer.json")

Parquet input via pyarrow (file I/O happens in Python, not in Rust):

import pyarrow.parquet as pq

def parquet_text_column(path, column="text"):
    for batch in pq.ParquetFile(path).iter_batches(columns=[column]):
        yield from batch.column(column).to_pylist()

trainer.train_from_iterator(
    tokenizer,
    train_iterators=[parquet_text_column(p) for p in lang_parquets],
    ratio=[1.0, 1.2, 0.9],  # alternative to dev_iterators
)

Verification

Unit tests (21 parity-specific tests, all passing):

  • Symmetric parity merging, dev-driven selection with inverted priorities
  • Window variant enforcing fairness, exhausted language handling
  • Global merge warmup, min frequency filtering
  • Ratio-based selection (base and window variants)
  • Partial dev files across languages (some languages with dev, some without)
  • Serialization roundtrip, ratio length mismatch error
  • Config parsing: optional ratio field, custom text columns, string-or-vec deserialization
  • Pickle/repr roundtrip for PyParityBpeTrainer

End-to-end training on a 10-language setup:

  • Training data: FineWeb2 (one parquet shard per language)
  • Languages: Arabic, Hindi, Chinese, Thai, Russian, Tamil, Korean, Swahili, Finnish, Khmer
  • Dev data: FLORES-200 parallel sentences
  • Pre-tokenization: GPT-4o regex (o200k_base pattern) + ByteLevel encoding
  • 64k merges, base variant

Two training runs were compared:

  1. Dev-set mode: language selection driven by FLORES dev-set token lengths
  2. Ratio mode: language selection driven by target compression ratios (derived from the same FLORES data)
    compression_rate_faceted

We see that parity-aware BPE leads to much more even compression across languages than a standard BPE tokenizer (trained on the same data with the same pretokenization and vocab count). Per-language compression rates are nearly identical between the ratio and dev set modes, confirming that ratio-based selection correctly approximates dev-set-driven selection. One language (Khmer) shows slightly lower compression than expected. This is attributable to pre-tokenization: the GPT-4o regex splits on whitespace boundaries that don't align with Khmer's writing system(no spaces between words), so the pre-tokenizer produces very long unsplit character sequences that compress less efficiently than languages with whitespace word boundaries.

Citation

@article{foroutan-meister-et-al-2025-parity-aware-bpe,
  title={Parity-Aware Byte-Pair Encoding: Improving Cross-lingual Fairness in Tokenization},
  author={Foroutan, Negar and Meister, Clara and Paul, Debjit and Niklaus, Joel and Ahmadi, Sina and Bosselut, Antoine and Sennrich, Rico},
  url={https://arxiv.org/abs/2508.04796},
  booktitle={arXiv},
  year={2025}
}```

@HuggingFaceDocBuilderDev
Copy link
Copy Markdown

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

@ArthurZucker
Copy link
Copy Markdown
Collaborator

Will check it out, looks promising!

Comment thread bindings/python/Cargo.toml Outdated
Comment on lines +26 to +28
compact_str = "0.9"
parquet = { version = "55", default-features = false, features = ["arrow", "snap"] }
arrow = { version = "55", default-features = false }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats a lot of deps!

Copy link
Copy Markdown
Collaborator

@ArthurZucker ArthurZucker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

in general we want to avoid growing deps and growing code in the core.

It would be better if we are able to integrate parity isolated as an optional feature!

Comment thread tokenizers/Cargo.toml Outdated
Comment on lines +94 to +95
parquet = { version = "55", optional = true, default-features = false, features = ["arrow", "snap"] }
arrow = { version = "55", optional = true, default-features = false }
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to add this many deps

Comment thread train_tokenizer.py Outdated
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is in the wrong folder

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not really sure this makes sense, word count is an abstract concept and this does not follow the standard API we have. We need to find another solution to this.

Parquet are natively supported as well afaik when you do tokenizer.encode/encode_batch.

Splits are determined with offset and encodings do return them. This is a big no

… I/O)

Addresses HF maintainer feedback on the original parity-aware-bpe submission:

1. Avoid growing core dependencies — gate the entire feature behind a new
   `parity-aware-bpe` cargo feature (off by default in the `tokenizers` crate,
   on by default in `tokenizers-python`). Drop the `parquet` and `arrow`
   dependencies and the standalone `parity_bpe_train` CLI binary entirely.
   The binding adds no new dependencies.

2. Conform to the standard trainer API — delete `parity_utils.rs` (the ad-hoc
   word-count helper that doubled as a parquet reader) and `parity_config.rs`
   (the JSON config schema). The Python binding's `PyParityBpeTrainer` exposes
   only `train_from_iterator(tokenizer, train_iterators, dev_iterators=, ratio=)`,
   the multi-corpus analogue of `Tokenizer.train_from_iterator`. File I/O,
   parquet decoding, and config parsing all move to user-side Python code.

3. The Rust trainer adds `feed_language_from_iter` /
   `feed_dev_language_from_iter` methods that mirror `BpeTrainer::feed`'s
   `<I, S, F>` generics including the `Send`/`Sync` bounds for parallel
   iteration. The pre-existing map-taking `feed_language` / `feed_dev_language`
   helpers are now `#[cfg(test)]`-only, matching the way `BpeTrainer`'s
   `words` field is private and only populated via `Trainer::feed`. There is
   no public path for "I already have aggregated word counts" — that
   abstraction is internal to the trainer, exactly as in `BpeTrainer`.

The binding's `train_from_iterator` releases the GIL via `py.detach` around
the actual training work, mirroring `PyTokenizer::train_from_iterator`.

Additional cleanup based on the same review pass:

- Validate `window_size > 0`, `alpha > 0`, and per-language ratio values in
  `do_train()` so misconfiguration fails loudly instead of producing garbage.
- Tighten `ParityBpeTrainer` field visibility from `pub` to `pub(crate)` to
  match `#[non_exhaustive]`.
- Promote "all languages exhausted" / "global-merge mode exhausted" log lines
  from `info!` to `warn!`, and add a post-loop warning when fewer merges were
  produced than the (`total_symbols`-adjusted) target.
- Add 7 Python tests covering the parity API (instantiation defaults and
  variants, train_from_iterator with and without dev / with ratio, pickle
  round-trip, length-mismatch error).
@cimeister
Copy link
Copy Markdown
Author

Ok, I tried to address both of your comments and took care of some other stuff while at it. Current status:

On the deps / "isolate as an optional feature" point: there's now a parity-aware-bpe cargo feature, off by default in the core tokenizers crate so a basic cargo build should be identical to before. The Python crate keeps it on by default (via a default = ["ext-module", "parity-aware-bpe"] line), i.e., pip install tokenizers still ships the trainer. Users who want just the core binding can opt out though. The parquet and arrow deps are gone entirely, the standalone parity_bpe_train CLI binary is gone, and neither tokenizers/Cargo.toml nor bindings/python/Cargo.toml add any new dependencies, only the feature flag. The parity modules in bpe/mod.rs are gated. Just FYI, I didn't see any parquet support in tokenizer.encode/encode_batch. tokenizer.encode(input, …) and tokenizer.encode_batch(inputs, …) don't natively read parquet. They take strings (or pre-tokenized lists), not file paths. But I'm fine leaving that to the user if that's the preference/convention

On the "word count is an abstract concept and this doesn't follow the standard API" point: So parity_utils.rs is gone now. I looked at BpeTrainer and saw its words: AHashMap<…> field is private and the only way to populate it is via Trainer::feed(iter, process). So I tried to match that pattern: feed_language and feed_dev_language (the map-taking methods that were public before) are now #[cfg(test)]-only test fixtures. The only public way to feed ParityBpeTrainer is feed_language_from_iter / feed_dev_language_from_iter, which mirror BpeTrainer::feed's <I: Iterator<Item=S> + Send, S: AsRef + Send, F: Fn(&str) -> Result<Vec> + Sync> signature exactly just with an extra lang_idx: usize parameter. I hope I understood this the way you meant it...

On the parquet side: I ended up removing PyParityBpeTrainer::train() entirely, both the train_files=[...] and the config=... paths. The single Python entry point is now trainer.train_from_iterator(tokenizer, train_iterators, dev_iterators=None, ratio=None). Basically a multi-corpus analogue of Tokenizer.train_from_iterator. File I/O, parquet decoding, JSON config parsing, all of that lives in user-side Python now.

Stuff I changed proactively:

  • Added input validation in do_train: fails with a clear error if window_size == 0, if alpha is non-finite or non-positive, or if any ratio element is non-finite/non-positive or the ratio length doesn't match the number of languages.
  • Bumped two info! log lines about "languages exhausted" up to warn!, since "your training is going to produce a smaller vocab than you asked for" is probably something the user probably wants to see.
  • Added a post-loop warning when the total number of merges produced is less than what was targeted (with the total_symbols=true adjustment correctly applied there was afalse-positive bug here from the first pass).
  • Tightened all the pub fields on ParityBpeTrainer to pub(crate). They were pub + #[non_exhaustive] which is a contradictory pair anyway.
  • The Python train_from_iterator releases the GIL via py.detach(...) around the actual training work, mirroring how PyTokenizer::train_from_iterator does it. Iterator buffering still happens with the GIL held (PyBufferedIterator needs it), but feed + do_train + with_model + add_special_tokens all run GIL-released so other Python threads aren't blocked for the duration of training. I'm by no means an expert here (mostly stuff that Claude pointed out) so please lmk if this is the wrong protocol
  • Rebased the whole thing onto current HF main rather than onto the old fork tip, and squashed everything into one logical commit so you don't have to wade through 75 commits of intermediate edits (sorry about before with the ugly merge).
  • Rewrote the Python tests to match the new API: 7 tests now (test_instantiate_defaults, test_instantiate_variants, test_train_from_iterator, test_train_from_iterator_with_dev, test_train_from_iterator_with_ratio, test_can_pickle, test_train_iterators_dev_iterators_length_mismatch). If any of these are overkill, lmk and I can remove them

A couple of design choices that I'd like to flag explicitly so you can push back if you disagree:

  1. Why this is a separate trainer instead of an extension to BpeTrainer (the merge selection is the only thing that's actually different output is a vanilla BPE model; it should be fully compatible with Tokenizer.from_file()). I considered adding a languages field to BpeTrainer and conditionally branching, but it would have meant changing BpeTrainer's public API and Trainer::feed's contract for what looks like a niche use case. Keeping it as a separate trainer feels lower-risk. Happy to revisit if you'd rather have it merged in.
  2. Why ParityBpeTrainer doesn't implement the Trainer trait. Trainer::feed<I, S, F> takes a single iterator. Parity-aware BPE needs Vec<AHashMap<…>> indexed by language. There's currently no way to express "this sequence belongs to language 2 of 30" through a single-iterator interface, which the algorithm needs. My solution was to instead have the trainer sit outside TrainerWrapper and expose feed_language_from_iter directly, with train_from_iterator in the binding handling the per-language orchestration. The shape of feed_language_from_iter is identical to Trainer::feed modulo the lang_idx parameter, so it's not a different convention
  3. Why the binding takes Vec rather than Iterator. The latter would let you stream languages in dynamically but it'd be a pain for the common case (you have N files / N pyarrow streams) and it'd require carrying language identity inside the items somehow. Vec matches how users actually structure multilingual data and matches how Tokenizer.train_from_iterator accepts a list of sources.

Stuff I deliberately left for follow-up PRs, in roughly descending order of how much I think it's worth doing:

  • progress_format parity with BpeTrainer. The Indicatif / JsonLines / Silent modes recently added to BpeTrainer aren't mirrored on ParityBpeTrainer.
  • find_best_pair_linear and replace_pair_dev are O(unique pairs) and O(unique dev words) per merge step respectively. This was the only was to ensure exact lexicographic tie-break parity with the Python reference implementation, as far as I could tell. On realistic dev sets this is tractable, but would dominate training time on multi-million-word dev sets. We could build Pair → Vec<word_id> reverse indices (faster but more memory) though.
  • The 15 Rust unit tests use bare .unwrap(). Should perhaps switch to .expect("descriptive message") to give descriptive CI failures.
  • CI matrix entry: cargo test -p tokenizers under default features doesn't run the parity tests now (because they're behind the feature gate), so CI needs an explicit cargo test

@cimeister
Copy link
Copy Markdown
Author

Ok, I tried to address both of your comments and took care of some other stuff while at it. Current status:

On the deps / "isolate as an optional feature" point: there's now a parity-aware-bpe cargo feature, off by default in the core tokenizers crate so a basic cargo build should be identical to before. The Python crate keeps it on by default (via a default = ["ext-module", "parity-aware-bpe"] line), i.e., pip install tokenizers still ships the trainer. Users who want just the core binding can opt out though. The parquet and arrow deps are gone entirely, the standalone parity_bpe_train CLI binary is gone, and neither tokenizers/Cargo.toml nor bindings/python/Cargo.toml add any new....

Hey @ArthurZucker :) Just wanted to check in about whether you had time to look at the new changes. Sorry for the bother, we were just hoping to have this out before ACL releases conference papers

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants