-
Notifications
You must be signed in to change notification settings - Fork 0
Description
Summary
apr tokenize apply produces vocab.json + merges.txt (raw BPE artifacts) but entrenar expects tokenizer.json (HuggingFace complete format). A manual Python conversion step is currently needed.
Current State
apr tokenize applyoutputs:vocab.json(32,768 entries) +merges.txt(32,518 merges)entrenarloads viaaprender::text::bpe::qwen2::load_from_json()which expects HuggingFacetokenizer.jsonwithmodel.vocab(HashMap) +model.merges(Vec)- Manual conversion required via Python
tokenizerslibrary with:Split(pattern=' ', behavior='removed')pre-tokenizerBPEDecoder(suffix='</w>')decoder- Merges must be string format (
"i n") not array format (["i", "n"])
Root Cause
aprender's BPE implementation uses:
split_whitespace()pre-tokenizer (Rust's str::split_whitespace)</w>end-of-word suffix- Pure Rust BPE (not HuggingFace tokenizers crate)
But the save format is the raw training artifacts, not the HuggingFace interchange format.
Options
A. Add apr tokenize export --format hf to produce tokenizer.json directly
B. Add entrenar support for loading from vocab.json + merges.txt
C. Add a converter to alimentar: alimentar tokenizer convert vocab.json merges.txt -o tokenizer.json
Option A is preferred — keep the tokenizer pipeline self-contained in apr/aprender.
Workaround
Python script to convert (currently used in albor):
from tokenizers import Tokenizer, models, pre_tokenizers, decoders
bpe = models.BPE(vocab=vocab, merges=merges, end_of_word_suffix='</w>')
tokenizer = Tokenizer(bpe)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='removed')
tokenizer.decoder = decoders.BPEDecoder(suffix='</w>')
# Convert merges from array to string format for aprender compatibility
tokenizer.save('tokenizer.json')Additional Note
The whitespace-split pre-tokenizer normalizes all whitespace (newlines, indentation) to single spaces. This is a significant limitation for Python code models. Consider adding a ByteLevel or character-preserving pre-tokenizer option.
Impact
Blocks: ALB-028 (Phase 2 training pipeline) until workaround applied
Severity: Medium (workaround exists)
Components: aprender, entrenar, apr