Conversation
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
|
Will check it out, looks promising! |
| compact_str = "0.9" | ||
| parquet = { version = "55", default-features = false, features = ["arrow", "snap"] } | ||
| arrow = { version = "55", default-features = false } |
ArthurZucker
left a comment
There was a problem hiding this comment.
in general we want to avoid growing deps and growing code in the core.
It would be better if we are able to integrate parity isolated as an optional feature!
| parquet = { version = "55", optional = true, default-features = false, features = ["arrow", "snap"] } | ||
| arrow = { version = "55", optional = true, default-features = false } |
There was a problem hiding this comment.
I don't think we want to add this many deps
There was a problem hiding this comment.
this is in the wrong folder
There was a problem hiding this comment.
I am not really sure this makes sense, word count is an abstract concept and this does not follow the standard API we have. We need to find another solution to this.
Parquet are natively supported as well afaik when you do tokenizer.encode/encode_batch.
Splits are determined with offset and encodings do return them. This is a big no
… I/O) Addresses HF maintainer feedback on the original parity-aware-bpe submission: 1. Avoid growing core dependencies — gate the entire feature behind a new `parity-aware-bpe` cargo feature (off by default in the `tokenizers` crate, on by default in `tokenizers-python`). Drop the `parquet` and `arrow` dependencies and the standalone `parity_bpe_train` CLI binary entirely. The binding adds no new dependencies. 2. Conform to the standard trainer API — delete `parity_utils.rs` (the ad-hoc word-count helper that doubled as a parquet reader) and `parity_config.rs` (the JSON config schema). The Python binding's `PyParityBpeTrainer` exposes only `train_from_iterator(tokenizer, train_iterators, dev_iterators=, ratio=)`, the multi-corpus analogue of `Tokenizer.train_from_iterator`. File I/O, parquet decoding, and config parsing all move to user-side Python code. 3. The Rust trainer adds `feed_language_from_iter` / `feed_dev_language_from_iter` methods that mirror `BpeTrainer::feed`'s `<I, S, F>` generics including the `Send`/`Sync` bounds for parallel iteration. The pre-existing map-taking `feed_language` / `feed_dev_language` helpers are now `#[cfg(test)]`-only, matching the way `BpeTrainer`'s `words` field is private and only populated via `Trainer::feed`. There is no public path for "I already have aggregated word counts" — that abstraction is internal to the trainer, exactly as in `BpeTrainer`. The binding's `train_from_iterator` releases the GIL via `py.detach` around the actual training work, mirroring `PyTokenizer::train_from_iterator`. Additional cleanup based on the same review pass: - Validate `window_size > 0`, `alpha > 0`, and per-language ratio values in `do_train()` so misconfiguration fails loudly instead of producing garbage. - Tighten `ParityBpeTrainer` field visibility from `pub` to `pub(crate)` to match `#[non_exhaustive]`. - Promote "all languages exhausted" / "global-merge mode exhausted" log lines from `info!` to `warn!`, and add a post-loop warning when fewer merges were produced than the (`total_symbols`-adjusted) target. - Add 7 Python tests covering the parity API (instantiation defaults and variants, train_from_iterator with and without dev / with ratio, pickle round-trip, length-mismatch error).
e21c998 to
b5c453d
Compare
|
Ok, I tried to address both of your comments and took care of some other stuff while at it. Current status: On the deps / "isolate as an optional feature" point: there's now a parity-aware-bpe cargo feature, off by default in the core tokenizers crate so a basic cargo build should be identical to before. The Python crate keeps it on by default (via a default = ["ext-module", "parity-aware-bpe"] line), i.e., pip install tokenizers still ships the trainer. Users who want just the core binding can opt out though. The parquet and arrow deps are gone entirely, the standalone parity_bpe_train CLI binary is gone, and neither tokenizers/Cargo.toml nor bindings/python/Cargo.toml add any new dependencies, only the feature flag. The parity modules in bpe/mod.rs are gated. Just FYI, I didn't see any parquet support in On the "word count is an abstract concept and this doesn't follow the standard API" point: So parity_utils.rs is gone now. I looked at BpeTrainer and saw its words: AHashMap<…> field is private and the only way to populate it is via Trainer::feed(iter, process). So I tried to match that pattern: feed_language and feed_dev_language (the map-taking methods that were public before) are now #[cfg(test)]-only test fixtures. The only public way to feed ParityBpeTrainer is feed_language_from_iter / feed_dev_language_from_iter, which mirror BpeTrainer::feed's <I: Iterator<Item=S> + Send, S: AsRef + Send, F: Fn(&str) -> Result<Vec> + Sync> signature exactly just with an extra lang_idx: usize parameter. I hope I understood this the way you meant it... On the parquet side: I ended up removing PyParityBpeTrainer::train() entirely, both the train_files=[...] and the config=... paths. The single Python entry point is now trainer.train_from_iterator(tokenizer, train_iterators, dev_iterators=None, ratio=None). Basically a multi-corpus analogue of Tokenizer.train_from_iterator. File I/O, parquet decoding, JSON config parsing, all of that lives in user-side Python now. Stuff I changed proactively:
A couple of design choices that I'd like to flag explicitly so you can push back if you disagree:
Stuff I deliberately left for follow-up PRs, in roughly descending order of how much I think it's worth doing:
|
Hey @ArthurZucker :) Just wanted to check in about whether you had time to look at the new changes. Sorry for the bother, we were just hoping to have this out before ACL releases conference papers |
Add parity-aware BPE trainer
This PR adds a parity-aware BPE trainer, the algorithm proposed in Foroutan et al. (2025). Standard BPE tends to over-segment low-resource languages; parity-aware BPE addresses this by selecting merges that compress the least compressed language at each step, balancing compression across languages. This should improve cross-lingual fairness in tokenization (in terms of token counts per language).
The trainer produces a standard BPE model. The parity-aware logic only affects training, not inference. A trained tokenizer is fully compatible with the existing
Tokenizer.from_file()/tokenizer.save()workflow.What's included
Core Rust implementation (
tokenizers/src/models/bpe/parity_trainer.rs, gated behind the optionalparity-aware-bpecargo feature, off by default in thetokenizerscrate):feed_language_from_iter/feed_dev_language_from_iter— per-language analogues ofBpeTrainer::feedwith the same<I, S, F>generics includingSend/Syncbounds for parallel iteration viamaybe_par_bridgeparity-aware-bpe = []feature flagPython bindings (
bindings/python/src/trainers.rs, on by default in thetokenizers-pythoncrate):ParityBpeTrainerclass with a single training entry point:Tokenizer.train_from_iterator. File I/O, parquet decoding, and config parsing all happen in user-side Python code.py.detacharound the actual training work, mirroringPyTokenizer::train_from_iterator__getstate__/__setstate__supportDesign decisions
Why
ParityBpeTrainerdoes not implement theTrainertrait. TheTrainer::feed<I, S, F>method assumes a single-corpus workflow: it takes one iterator of sequences and accumulates them into a single internal word count map. Parity-aware BPE fundamentally requires separate, labeled per-language corpora — the language-selection heuristic operates on independentVec<AHashMap<…>>statistics, not a merged map. Rather than providing a silently-brokenfeed()that would collapse all languages into one,ParityBpeTrainerexposes a parallel iterator API (feed_language_from_iter(lang_idx, iterator, process)) that mirrorsTrainer::feedexactly but operates per language, and the Python binding exposes a dedicatedtrain_from_iterator(tokenizer, train_iterators, …)method that handles the per-language orchestration.For the same reason,
ParityBpeTraineris not added toTrainerWrapper(which is keyed by theTrainertrait).No public "I already have word counts" API.
BpeTrainer'swords: AHashMap<CompactString, u64>field is private; the only way to populate it is viaTrainer::feed(iter, process).ParityBpeTrainerfollows the same pattern: the per-languageVec<AHashMap<CompactString, u64>>is internal state, populated only byfeed_language_from_iter/feed_dev_language_from_iter. Map-taking helpers exist as#[cfg(test)]-only fixtures so unit tests can construct populated trainers from literals; they are not part of the public API.Known follow-ups
Feature parity with
BpeTrainer.progress_format. The same line of work that extendsBpeTrainerwithIndicatif/JsonLines/Silentprogress modes has not been mirrored onParityBpeTrainer.Performance of
find_best_pair_linearandreplace_pair_dev. Both are currently O(unique-pairs) and O(unique-dev-words) per merge step respectively. They trade speed for exact lexicographic tie-breaking parity with the Python reference implementation. Could buildPair → Vec<word_id>reverse indices to make faster.Rust test assertions. The 13 tests in
tokenizers/src/models/bpe/parity_trainer.rs::testsuse bare.unwrap(). Switching to.expect("descriptive message")would give self-describing CI failures.CI matrix entry for the parity feature. Because
parity_traineris gated behind#[cfg(feature = "parity-aware-bpe")],cargo test -p tokenizersunder default features does not run the 13 parity-specific tests. CI must explicitly runcargo test --features parity-aware-bpeto cover them.Usage
Parquet input via
pyarrow(file I/O happens in Python, not in Rust):Verification
Unit tests (21 parity-specific tests, all passing):
PyParityBpeTrainerEnd-to-end training on a 10-language setup:
Two training runs were compared:
We see that parity-aware BPE leads to much more even compression across languages than a standard BPE tokenizer (trained on the same data with the same pretokenization and vocab count). Per-language compression rates are nearly identical between the ratio and dev set modes, confirming that ratio-based selection correctly approximates dev-set-driven selection. One language (Khmer) shows slightly lower compression than expected. This is attributable to pre-tokenization: the GPT-4o regex splits on whitespace boundaries that don't align with Khmer's writing system(no spaces between words), so the pre-tokenizer produces very long unsplit character sequences that compress less efficiently than languages with whitespace word boundaries.
Citation