Skip to content

improved benchmark tool (hf-llm-trainer skill)#128

Open
evalstate wants to merge 1 commit intomainfrom
feat/improve-benchmark-tool
Open

improved benchmark tool (hf-llm-trainer skill)#128
evalstate wants to merge 1 commit intomainfrom
feat/improve-benchmark-tool

Conversation

@evalstate
Copy link
Copy Markdown
Collaborator

Improved benchmark tool. Help text is below for reference:

usage: hf_benchmarks.py [-h] {search,leaderboard,model-results} ...

Search benchmark datasets and fetch leaderboard results from the Hugging Face Hub.

Workflow ideas:
  1) Discover candidate benchmarks:
       hf_benchmarks.py search --alias ocr
       hf_benchmarks.py search --alias coding
       hf_benchmarks.py search --task image-to-text --modality document

  2) Inspect a leaderboard:
       hf_benchmarks.py leaderboard allenai/olmOCR-bench --top 10

  3) Chain search -> leaderboard:
       hf_benchmarks.py search --alias coding --format ndjson \
         | hf_benchmarks.py leaderboard --stdin --top 5 --format table

  4) Fetch eval results for a list of models:
       printf '%s\n' Qwen/Qwen3.5-9B microsoft/Phi-3-medium-4k-instruct \
         | hf_benchmarks.py model-results --stdin --format ndjson

  5) Use hf CLI for model discovery, then enrich with this tool:
       hf models list --search 'Phi-3' --filter eval-results --limit 5 --format json \
         | jq -r '.[].id' \
         | hf_benchmarks.py model-results --stdin --format table

  6) Use hf CLI for dataset discovery, then fetch leaderboards:
       hf datasets list --search 'swe' --filter benchmark:official --limit 5 --format json \
         | jq -r '.[].id' \
         | hf_benchmarks.py leaderboard --stdin --top 5 --format table

positional arguments:
  {search,leaderboard,model-results}
    search              Search benchmark datasets by query, alias, task, and modality
    leaderboard         Fetch normalized leaderboard rows for one or more benchmark datasets
    model-results       Fetch normalized evalResults rows for one or more models

options:
  -h, --help            show this help message and exit


search command options:
  usage: hf_benchmarks.py search [-h] [--query QUERY] [--alias ALIAS] [--task TASK] [--modality MODALITY] [--limit LIMIT]
                                 [--format {table,json,ndjson}]

  options:
    -h, --help            show this help message and exit
    --query QUERY         Free-text query to match against benchmark dataset metadata. Repeatable.
    --alias ALIAS         Convenience alias for common benchmark domains. Known aliases: agents, asr, coding, math, ocr, retrieval.
                          Repeatable.
    --task TASK           Task to match, e.g. text-generation, image-to-text, question-answering. Repeatable.
    --modality MODALITY   Modality to match, e.g. text, image, document, audio. Repeatable.
    --limit LIMIT         Maximum number of rows to print (default: 20).
    --format {table,json,ndjson}
                          Output format (default: table).

leaderboard command options:
  usage: hf_benchmarks.py leaderboard [-h] [--stdin] [--task-id TASK_ID] [--top TOP] [--format {table,json,ndjson}] [datasets ...]

  Fetch normalized leaderboard rows for one or more benchmark datasets.

  This command is designed to pair well with `hf datasets list`, where
  `hf` handles benchmark dataset discovery and this tool handles
  leaderboard retrieval / flattening.

  positional arguments:
    datasets              Dataset repo ids (<namespace>/<repo>). Can also be supplied via stdin with --stdin.

  options:
    -h, --help            show this help message and exit
    --stdin               Read dataset ids from stdin. Accepts plain repo ids or NDJSON with dataset_id/id fields.
    --task-id TASK_ID     Optional leaderboard task_id query parameter.
    --top TOP             Only keep the top N results per leaderboard.
    --format {table,json,ndjson}
                          Output format (default: table).

  Examples:
    hf_benchmarks.py leaderboard allenai/olmOCR-bench --top 10

    printf '%s\n' openai/gsm8k SWE-bench/SWE-bench_Verified \
      | hf_benchmarks.py leaderboard --stdin --top 5 --format ndjson

    hf datasets list --search 'swe' --filter benchmark:official --limit 5 --format json \
      | jq -r '.[].id' \
      | hf_benchmarks.py leaderboard --stdin --top 5 --format table

model-results command options:
  usage: hf_benchmarks.py model-results [-h] [--stdin] [--dataset DATASET] [--task-id TASK_ID] [--top TOP] [--format {table,json,ndjson}]
                                        [models ...]

  Fetch normalized evalResults rows for one or more model repos.

  This command is designed to pair well with `hf models list`, where
  `hf` handles discovery and this tool handles flattening / filtering
  per-model benchmark results.

  positional arguments:
    models                Model repo ids (<namespace>/<repo>). Can also be supplied via stdin with --stdin.

  options:
    -h, --help            show this help message and exit
    --stdin               Read model ids from stdin. Accepts plain repo ids or NDJSON with model_id/id fields.
    --dataset DATASET     Only keep eval rows whose dataset_id matches one of these values. Repeatable.
    --task-id TASK_ID     Only keep eval rows whose task_id matches one of these values. Repeatable.
    --top TOP             Only keep the top N eval rows per model after filtering.
    --format {table,json,ndjson}
                          Output format (default: table).

  Examples:
    hf_benchmarks.py model-results Qwen/Qwen3.5-9B

    printf '%s\n' Qwen/Qwen3.5-9B microsoft/Phi-3-medium-4k-instruct \
      | hf_benchmarks.py model-results --stdin --format ndjson

    hf models list --search 'Phi-3' --filter eval-results --limit 5 --format json \
      | jq -r '.[].id' \
      | hf_benchmarks.py model-results --stdin --dataset openai/gsm8k --format table


@evalstate evalstate changed the title improved benchmark tool improved benchmark tool (hf-llm-trainer skill) Apr 22, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant