Skip to content

Conversation

@koichog
Copy link

@koichog koichog commented May 27, 2025

Code Structure

  • Introduced tokenizer_config struct to encapsulate CLI options and state
  • Modularized CLI parsing into:
    • process_command_line_args()
    • parse_command_line_args()
  • Explicit handling of mutually exclusive prompt sources:
    • --prompt
    • --file
    • --stdin

Functional Enhancements

  • Integrated log.h logging macros (LOG_DBG, LOG_INF, LOG_ERR) for color-coded and level-aware output
  • Added --log-disable flag to mute logs for scripting or debugging
  • Escape sequence control via --no-escape option
  • Optional --show-count flag for token count summary
  • --no-bos and --no-parse-special options toggle BOS token and special token parsing

Cross-Platform Improvements

  • Replaced std::cout and printf with unified logging
  • Added proper UTF-8 console output handling for Windows (write_utf8_to_stdout)
  • CLI arguments handled safely across platforms (UTF-8 on Windows via CommandLineToArgvW)

UX Improvements

  • Rewritten help message with:
    • Clear usage structure
    • Grouped flags by category
    • Realistic examples for quickstart

Summary of Benefits

  • Clean separation of logic and configuration
  • Easier to extend and maintain
  • Better developer experience with structured logs
  • Better user experience with error feedback and usage guidance

Example

./llama-tokenize -m model.gguf -p "Hello world" --show-count

@CISC
Copy link
Collaborator

CISC commented May 27, 2025

This tool should probably be migrated to the common_params parser. @ggerganov ?

@ggerganov
Copy link
Member

This tool should probably be migrated to the common_params parser. @ggerganov ?

Yes, it already depends on libcommon so it should reuse the argument parser. The alternative is to eliminate the dependency on libcommon all together and keep the current argument parsing. Either way is OK.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants