feat: add progress_format option for machine-readable JSON output #1921
+111
−8
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.

Summary
Add a
progress_formatoption toBpeTrainerthat allows choosing between different progress output formats:Indicatif(default): Interactive terminal progress bars (current behavior, unchanged)JsonLines: Machine-readable JSON lines to stderrSilent: No progress outputThis enables programmatic consumption of training progress for web UIs, logging systems, and other non-TTY environments where indicatif progress bars are not visible.
Motivation
When running tokenizer training in a web application backend or logging environment, the indicatif progress bars:
This PR adds an opt-in JSON output mode that emits structured progress data:
{"stage":"Tokenize words","current":1000,"total":5000000} {"stage":"Count pairs","current":500,"total":5000000} {"stage":"Compute merges","current":30000,"total":65536}Changes
Rust Core
ProgressFormatenum totokenizers/src/utils/progress.rsProgressFormatfromtokenizers/src/utils/mod.rsandtokenizers/src/lib.rsprogress_formatfield and.progress_format()builder method toBpeTrainersetup_progress()to only create indicatif bar when format isIndicatifemit_json_progress()helper that outputs JSON when format isJsonLinesget_word_count()method toBpeTrainerfor progress estimationPython Bindings
progress_formatparameter toBpeTrainerconstructor (accepts "indicatif", "json", "silent")progress_formatgetter/setter propertiesget_word_count()methodUsage
Backward Compatibility
Indicatif- identical to current behaviorTest Plan
get_word_count()returns correct count after feeding