Skip to content

ALB-033: apr tokenize → entrenar tokenizer format gap #31

@noahgift

Description

@noahgift

Summary

apr tokenize apply produces vocab.json + merges.txt (raw BPE artifacts) but entrenar expects tokenizer.json (HuggingFace complete format). A manual Python conversion step is currently needed.

Current State

  • apr tokenize apply outputs: vocab.json (32,768 entries) + merges.txt (32,518 merges)
  • entrenar loads via aprender::text::bpe::qwen2::load_from_json() which expects HuggingFace tokenizer.json with model.vocab (HashMap) + model.merges (Vec)
  • Manual conversion required via Python tokenizers library with:
    • Split(pattern=' ', behavior='removed') pre-tokenizer
    • BPEDecoder(suffix='</w>') decoder
    • Merges must be string format ("i n") not array format (["i", "n"])

Root Cause

aprender's BPE implementation uses:

  • split_whitespace() pre-tokenizer (Rust's str::split_whitespace)
  • </w> end-of-word suffix
  • Pure Rust BPE (not HuggingFace tokenizers crate)

But the save format is the raw training artifacts, not the HuggingFace interchange format.

Options

A. Add apr tokenize export --format hf to produce tokenizer.json directly
B. Add entrenar support for loading from vocab.json + merges.txt
C. Add a converter to alimentar: alimentar tokenizer convert vocab.json merges.txt -o tokenizer.json

Option A is preferred — keep the tokenizer pipeline self-contained in apr/aprender.

Workaround

Python script to convert (currently used in albor):

from tokenizers import Tokenizer, models, pre_tokenizers, decoders
bpe = models.BPE(vocab=vocab, merges=merges, end_of_word_suffix='</w>')
tokenizer = Tokenizer(bpe)
tokenizer.pre_tokenizer = pre_tokenizers.Split(pattern=' ', behavior='removed')
tokenizer.decoder = decoders.BPEDecoder(suffix='</w>')
# Convert merges from array to string format for aprender compatibility
tokenizer.save('tokenizer.json')

Additional Note

The whitespace-split pre-tokenizer normalizes all whitespace (newlines, indentation) to single spaces. This is a significant limitation for Python code models. Consider adding a ByteLevel or character-preserving pre-tokenizer option.

Impact

Blocks: ALB-028 (Phase 2 training pipeline) until workaround applied
Severity: Medium (workaround exists)
Components: aprender, entrenar, apr

Metadata

Metadata

Assignees

No one assigned

    Labels

    dogfoodingIssues found during albor dogfoodinggapUpstream gap identified by Albor

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions