Tokenize logging #13821

koichog · 2025-05-27T07:46:52Z

Code Structure

Introduced tokenizer_config struct to encapsulate CLI options and state
Modularized CLI parsing into:
- process_command_line_args()
- parse_command_line_args()
Explicit handling of mutually exclusive prompt sources:
- --prompt
- --file
- --stdin

Functional Enhancements

Integrated log.h logging macros (LOG_DBG, LOG_INF, LOG_ERR) for color-coded and level-aware output
Added --log-disable flag to mute logs for scripting or debugging
Escape sequence control via --no-escape option
Optional --show-count flag for token count summary
--no-bos and --no-parse-special options toggle BOS token and special token parsing

Cross-Platform Improvements

Replaced std::cout and printf with unified logging
Added proper UTF-8 console output handling for Windows (write_utf8_to_stdout)
CLI arguments handled safely across platforms (UTF-8 on Windows via CommandLineToArgvW)

UX Improvements

Rewritten help message with:
- Clear usage structure
- Grouped flags by category
- Realistic examples for quickstart

Summary of Benefits

Clean separation of logic and configuration
Easier to extend and maintain
Better developer experience with structured logs
Better user experience with error feedback and usage guidance

Example

./llama-tokenize -m model.gguf -p "Hello world" --show-count

CISC · 2025-05-27T07:58:18Z

This tool should probably be migrated to the common_params parser. @ggerganov ?

ggerganov · 2025-05-27T08:09:24Z

This tool should probably be migrated to the common_params parser. @ggerganov ?

Yes, it already depends on libcommon so it should reuse the argument parser. The alternative is to eliminate the dependency on libcommon all together and keep the current argument parsing. Either way is OK.

koichog added 2 commits May 27, 2025 09:51

Logging and general improvements

0344598

Logging and general improvements

b58e42a

github-actions bot added the examples label May 27, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Tokenize logging #13821

Tokenize logging #13821

Uh oh!

koichog commented May 27, 2025

Uh oh!

CISC commented May 27, 2025

Uh oh!

ggerganov commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Tokenize logging #13821

Are you sure you want to change the base?

Tokenize logging #13821

Uh oh!

Conversation

koichog commented May 27, 2025

Code Structure

Functional Enhancements

Cross-Platform Improvements

UX Improvements

Summary of Benefits

Example

Uh oh!

CISC commented May 27, 2025

Uh oh!

ggerganov commented May 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants