improved benchmark tool (hf-llm-trainer skill) by evalstate · Pull Request #128 · huggingface/skills

evalstate · 2026-04-22T10:29:11Z

Improved benchmark tool. Help text is below for reference:

usage: hf_benchmarks.py [-h] {search,leaderboard,model-results} ...

Search benchmark datasets and fetch leaderboard results from the Hugging Face Hub.

Workflow ideas:
  1) Discover candidate benchmarks:
       hf_benchmarks.py search --alias ocr
       hf_benchmarks.py search --alias coding
       hf_benchmarks.py search --task image-to-text --modality document

  2) Inspect a leaderboard:
       hf_benchmarks.py leaderboard allenai/olmOCR-bench --top 10

  3) Chain search -> leaderboard:
       hf_benchmarks.py search --alias coding --format ndjson \
         | hf_benchmarks.py leaderboard --stdin --top 5 --format table

  4) Fetch eval results for a list of models:
       printf '%s\n' Qwen/Qwen3.5-9B microsoft/Phi-3-medium-4k-instruct \
         | hf_benchmarks.py model-results --stdin --format ndjson

  5) Use hf CLI for model discovery, then enrich with this tool:
       hf models list --search 'Phi-3' --filter eval-results --limit 5 --format json \
         | jq -r '.[].id' \
         | hf_benchmarks.py model-results --stdin --format table

  6) Use hf CLI for dataset discovery, then fetch leaderboards:
       hf datasets list --search 'swe' --filter benchmark:official --limit 5 --format json \
         | jq -r '.[].id' \
         | hf_benchmarks.py leaderboard --stdin --top 5 --format table

positional arguments:
  {search,leaderboard,model-results}
    search              Search benchmark datasets by query, alias, task, and modality
    leaderboard         Fetch normalized leaderboard rows for one or more benchmark datasets
    model-results       Fetch normalized evalResults rows for one or more models

options:
  -h, --help            show this help message and exit


search command options:
  usage: hf_benchmarks.py search [-h] [--query QUERY] [--alias ALIAS] [--task TASK] [--modality MODALITY] [--limit LIMIT]
                                 [--format {table,json,ndjson}]

  options:
    -h, --help            show this help message and exit
    --query QUERY         Free-text query to match against benchmark dataset metadata. Repeatable.
    --alias ALIAS         Convenience alias for common benchmark domains. Known aliases: agents, asr, coding, math, ocr, retrieval.
                          Repeatable.
    --task TASK           Task to match, e.g. text-generation, image-to-text, question-answering. Repeatable.
    --modality MODALITY   Modality to match, e.g. text, image, document, audio. Repeatable.
    --limit LIMIT         Maximum number of rows to print (default: 20).
    --format {table,json,ndjson}
                          Output format (default: table).

leaderboard command options:
  usage: hf_benchmarks.py leaderboard [-h] [--stdin] [--task-id TASK_ID] [--top TOP] [--format {table,json,ndjson}] [datasets ...]

  Fetch normalized leaderboard rows for one or more benchmark datasets.

  This command is designed to pair well with `hf datasets list`, where
  `hf` handles benchmark dataset discovery and this tool handles
  leaderboard retrieval / flattening.

  positional arguments:
    datasets              Dataset repo ids (<namespace>/<repo>). Can also be supplied via stdin with --stdin.

  options:
    -h, --help            show this help message and exit
    --stdin               Read dataset ids from stdin. Accepts plain repo ids or NDJSON with dataset_id/id fields.
    --task-id TASK_ID     Optional leaderboard task_id query parameter.
    --top TOP             Only keep the top N results per leaderboard.
    --format {table,json,ndjson}
                          Output format (default: table).

  Examples:
    hf_benchmarks.py leaderboard allenai/olmOCR-bench --top 10

    printf '%s\n' openai/gsm8k SWE-bench/SWE-bench_Verified \
      | hf_benchmarks.py leaderboard --stdin --top 5 --format ndjson

    hf datasets list --search 'swe' --filter benchmark:official --limit 5 --format json \
      | jq -r '.[].id' \
      | hf_benchmarks.py leaderboard --stdin --top 5 --format table

model-results command options:
  usage: hf_benchmarks.py model-results [-h] [--stdin] [--dataset DATASET] [--task-id TASK_ID] [--top TOP] [--format {table,json,ndjson}]
                                        [models ...]

  Fetch normalized evalResults rows for one or more model repos.

  This command is designed to pair well with `hf models list`, where
  `hf` handles discovery and this tool handles flattening / filtering
  per-model benchmark results.

  positional arguments:
    models                Model repo ids (<namespace>/<repo>). Can also be supplied via stdin with --stdin.

  options:
    -h, --help            show this help message and exit
    --stdin               Read model ids from stdin. Accepts plain repo ids or NDJSON with model_id/id fields.
    --dataset DATASET     Only keep eval rows whose dataset_id matches one of these values. Repeatable.
    --task-id TASK_ID     Only keep eval rows whose task_id matches one of these values. Repeatable.
    --top TOP             Only keep the top N eval rows per model after filtering.
    --format {table,json,ndjson}
                          Output format (default: table).

  Examples:
    hf_benchmarks.py model-results Qwen/Qwen3.5-9B

    printf '%s\n' Qwen/Qwen3.5-9B microsoft/Phi-3-medium-4k-instruct \
      | hf_benchmarks.py model-results --stdin --format ndjson

    hf models list --search 'Phi-3' --filter eval-results --limit 5 --format json \
      | jq -r '.[].id' \
      | hf_benchmarks.py model-results --stdin --dataset openai/gsm8k --format table

improved benchmark tool

2335124

evalstate changed the title ~~improved benchmark tool~~ improved benchmark tool (hf-llm-trainer skill) Apr 22, 2026

evalstate mentioned this pull request Apr 22, 2026

Add hf-best-model skill #125

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improved benchmark tool (hf-llm-trainer skill)#128

improved benchmark tool (hf-llm-trainer skill)#128
evalstate wants to merge 1 commit intomainfrom
feat/improve-benchmark-tool

evalstate commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

evalstate commented Apr 22, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant