Skip to content

Conversation

@thad0ctor
Copy link

@thad0ctor thad0ctor commented Jun 15, 2025

Minor improvments to llama-bench

New Features

  1. Separate Prompt/Generation Timing: Provides detailed performance metrics by separately measuring prompt processing and token generation.
  2. n_threads_batch: Add n_threads_batch to available commands

@thad0ctor
Copy link
Author

thad0ctor commented Jun 15, 2025

Added commands in bold:

./llama-bench --help
ggml_cuda_init: GGML_CUDA_FORCE_MMQ: yes
ggml_cuda_init: GGML_CUDA_FORCE_CUBLAS: no
ggml_cuda_init: found 3 CUDA devices:
Device 0: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 1: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
Device 2: NVIDIA GeForce RTX 5090, compute capability 12.0, VMM: yes
usage: ./llama-bench [options]

options:
-h, --help
--numa <distribute|isolate|numactl> numa mode (default: disabled)
-r, --repetitions number of times to repeat each test (default: 5)
--prio <-1|0|1|2|3> process/thread priority (default: 0)
--delay <0...N> (seconds) delay between each test (default: 0)
-o, --output <csv|json|jsonl|md|sql> output format printed to stdout (default: md)
-oe, --output-err <csv|json|jsonl|md|sql> output format printed to stderr (default: none)
-v, --verbose verbose output
--progress print test progress indicators

test parameters:
-m, --model (default: models/7B/ggml-model-q4_0.gguf)
-p, --n-prompt (default: 512)
-n, --n-gen (default: 128)
-pg <pp,tg> (default: )
-d, --n-depth (default: 0)
-b, --batch-size (default: 2048)
-ub, --ubatch-size (default: 512)
-ctk, --cache-type-k (default: f16)
-ctv, --cache-type-v (default: f16)
-dt, --defrag-thold (default: -1)
-t, --threads (default: 24)
--n-threads-batch (default: 24)
-C, --cpu-mask <hex,hex> (default: 0x0)
--cpu-strict <0|1> (default: 0)
--poll <0...100> (default: 50)
-ngl, --n-gpu-layers (default: 99)
-sm, --split-mode <none|layer|row> (default: layer)
-mg, --main-gpu (default: 0)
-nkvo, --no-kv-offload <0|1> (default: 0)
-fa, --flash-attn <0|1> (default: 0)
-mmp, --mmap <0|1> (default: 1)
-embd, --embeddings <0|1> (default: 0)
-ts, --tensor-split <ts0/ts1/..> (default: 0)
-ot --override-tensors =;...
(default: disabled)
-nopo, --no-op-offload <0|1> (default: 0)

@thad0ctor
Copy link
Author

thad0ctor commented Jun 15, 2025

OBE

@slaren
Copy link
Member

slaren commented Jun 16, 2025

llama-bench accepts ranges for the numeric parameters, e.g. to test pp from 128 to 256 in increments of 64, you can use llama-bench -p 128-256+64. How does this functionality differ?

@thad0ctor
Copy link
Author

thad0ctor commented Jun 16, 2025

llama-bench accepts ranges for the numeric parameters, e.g. to test pp from 128 to 256 in increments of 64, you can use llama-bench -p 128-256+64. How does this functionality differ?

Good catch.

I missed this detail in the documentation so can remove the added args. The addition of the pp and gen t/s columns in the console output are likely worth keeping, also the n-theads-batch paramater. But the rest can go.

@thad0ctor thad0ctor marked this pull request as draft June 16, 2025 01:34
@thad0ctor thad0ctor closed this Jun 16, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants