Skip to content

Conversation

@podarok
Copy link

@podarok podarok commented Dec 26, 2025

Summary

Add a progress_format option to BpeTrainer that allows choosing between different progress output formats:

  • Indicatif (default): Interactive terminal progress bars (current behavior, unchanged)
  • JsonLines: Machine-readable JSON lines to stderr
  • Silent: No progress output

This enables programmatic consumption of training progress for web UIs, logging systems, and other non-TTY environments where indicatif progress bars are not visible.

Motivation

When running tokenizer training in a web application backend or logging environment, the indicatif progress bars:

  1. Don't display (auto-hidden when stderr is not a TTY)
  2. Can't be parsed programmatically even if captured

This PR adds an opt-in JSON output mode that emits structured progress data:

{"stage":"Tokenize words","current":1000,"total":5000000}
{"stage":"Count pairs","current":500,"total":5000000}
{"stage":"Compute merges","current":30000,"total":65536}

Changes

Rust Core

  • Add ProgressFormat enum to tokenizers/src/utils/progress.rs
  • Export ProgressFormat from tokenizers/src/utils/mod.rs and tokenizers/src/lib.rs
  • Add progress_format field and .progress_format() builder method to BpeTrainer
  • Modify setup_progress() to only create indicatif bar when format is Indicatif
  • Add emit_json_progress() helper that outputs JSON when format is JsonLines
  • Add get_word_count() method to BpeTrainer for progress estimation

Python Bindings

  • Add progress_format parameter to BpeTrainer constructor (accepts "indicatif", "json", "silent")
  • Add progress_format getter/setter properties
  • Add get_word_count() method

Usage

from tokenizers.trainers import BpeTrainer

# Current usage - unchanged
trainer = BpeTrainer(vocab_size=65536, show_progress=True)

# New: machine-readable JSON output
trainer = BpeTrainer(vocab_size=65536, progress_format="json")

# Or set after creation
trainer.progress_format = "json"

Backward Compatibility

  • Default behavior is Indicatif - identical to current behavior
  • Existing code works without any changes
  • New option is opt-in only

Test Plan

  • Rust core compiles
  • Python bindings build
  • Default format shows indicatif progress bars
  • JSON format outputs valid JSON lines to stderr
  • Silent format produces no output
  • get_word_count() returns correct count after feeding

Add ProgressFormat enum to control how training progress is reported:
- Indicatif (default): Interactive terminal progress bars
- JsonLines: Machine-readable JSON lines to stderr
- Silent: No progress output

Changes:
- Add ProgressFormat enum to tokenizers/src/utils/progress.rs
- Add progress_format field and builder method to BpeTrainer
- Modify setup_progress() to respect progress_format
- Add emit_json_progress() helper for JSON output
- Expose progress_format getter/setter in Python bindings
- Add get_word_count() method to BpeTrainer

JSON output format:
{"stage":"Tokenize words","current":1000,"total":5000000}

This enables programmatic consumption of training progress for web UIs,
logging systems, and other non-TTY environments where indicatif progress
bars are not visible.
@podarok
Copy link
Author

podarok commented Dec 26, 2025

image works great with this change

podarok added a commit to podarok/datasets that referenced this pull request Dec 28, 2025
Similar to huggingface/tokenizers#1921, adds machine-readable JSON progress output.

- Add set_progress_format() and get_progress_format() functions
- Support 'tqdm' (default), 'json', and 'silent' formats
- Emit JSON progress every 5% when format='json'
- Export new functions from datasets.utils

Cross-reference: huggingface/tokenizers#1921
podarok added a commit to podarok/huggingface_hub that referenced this pull request Dec 30, 2025
Add set_progress_format() and get_progress_format() functions to control
progress output format:
- "tqdm" (default): Interactive progress bars
- "json": Machine-readable JSON lines to stderr
- "silent": No progress output

When format is "json", emits progress every 5% as:
{"stage":"Downloading file","current":1024,"total":4096,"percent":25.0}

Similar to huggingface/tokenizers#1921 and huggingface/datasets#7920
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant