Skip to content

Commit b315ef3

Browse files
authored
refactor: Cli (#3440)
modular subcommands (run/ls/validate), YAML config support (--config), new EvaluatorConfig dataclass.
1 parent 41f6bfa commit b315ef3

19 files changed

+2851
-754
lines changed

README.md

Lines changed: 16 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
---
66

77
## Latest News 📣
8+
- [2025/12] **CLI refactored** with subcommands (`run`, `ls`, `validate`) and YAML config file support via `--config`. See the [CLI Reference](./docs/interface.md) and [Configuration Guide](./docs/config_files.md).
9+
- [2025/12] **Lighter install**: Base package no longer includes `transformers`/`torch`. Install model backends separately: `pip install lm_eval[hf]`, `lm_eval[vllm]`, etc.
810
- [2025/07] Added `think_end_token` arg to `hf` (token/str), `vllm` and `sglang` (str) for stripping CoT reasoning traces from models that support it.
911
- [2025/03] Added support for steering HF models!
1012
- [2025/02] Added [SGLang](https://docs.sglang.ai/) support!
@@ -95,11 +97,22 @@ A detailed table of all optional extras is available at the end of this document
9597

9698
## Basic Usage
9799

98-
### User Guide
100+
### Documentation
99101

100-
A user guide detailing the full list of supported arguments is provided [here](./docs/interface.md), and on the terminal by calling `lm_eval -h`. Alternatively, you can use `lm-eval` instead of `lm_eval`.
102+
| Guide | Description |
103+
|-------|-------------|
104+
| [CLI Reference](./docs/interface.md) | Command-line arguments and subcommands |
105+
| [Configuration Guide](./docs/config_files.md) | YAML config file format and examples |
106+
| [Python API](./docs/python-api.md) | Programmatic usage with `simple_evaluate()` |
107+
| [Task Guide](./lm_eval/tasks/README.md) | Available tasks and task configuration |
101108

102-
A list of supported tasks (or groupings of tasks) can be viewed with `lm-eval --tasks list`. Task descriptions and links to corresponding subfolders are provided [here](./lm_eval/tasks/README.md).
109+
Use `lm-eval -h` to see available options, or `lm-eval run -h` for evaluation options.
110+
111+
List available tasks with:
112+
113+
```bash
114+
lm-eval ls tasks
115+
```
103116

104117
### Hugging Face `transformers`
105118

docs/config_files.md

Lines changed: 121 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,121 @@
1+
# Configuration Guide
2+
3+
This guide explains how to use YAML configuration files with `lm-eval` to define reusable evaluation settings.
4+
5+
## Overview
6+
7+
Instead of passing many CLI arguments, you can define evaluation parameters in a YAML configuration file:
8+
9+
```bash
10+
# Instead of:
11+
lm-eval run --model hf --model_args pretrained=gpt2,dtype=float32 --tasks hellaswag arc_easy --num_fewshot 5 --batch_size 8 --device cuda:0
12+
13+
# Use:
14+
lm-eval run --config eval_config.yaml
15+
```
16+
17+
CLI arguments override config file values, so you can set defaults in a config file and override specific settings:
18+
19+
```bash
20+
lm-eval run --config eval_config.yaml --tasks mmlu --limit 100
21+
```
22+
23+
## Quick Reference
24+
25+
All configuration keys correspond directly to CLI arguments. See the [CLI Reference](interface.md#lm-eval-run) for detailed descriptions of each option.
26+
27+
## Config Schema
28+
29+
| Field | Type | Default | Description |
30+
|-------|------|---------|-------------|
31+
| `model` | string | `"hf"` | Model type/provider |
32+
| `model_args` | dict | `{}` | Model constructor arguments |
33+
| `tasks` | list/string | required | Tasks to evaluate |
34+
| `num_fewshot` | int/null | `null` | Few-shot example count |
35+
| `batch_size` | int/string | `1` | Batch size or "auto" |
36+
| `max_batch_size` | int/null | `null` | Max batch size for auto |
37+
| `device` | string/null | `"cuda:0"` | Device to use |
38+
| `limit` | float/null | `null` | Example limit per task |
39+
| `samples` | dict/null | `null` | Specific sample indices |
40+
| `use_cache` | string/null | `null` | Response cache path |
41+
| `cache_requests` | string/dict | `{}` | Request cache settings |
42+
| `output_path` | string/null | `null` | Results output path |
43+
| `log_samples` | bool | `false` | Save model I/O |
44+
| `predict_only` | bool | `false` | Skip metrics |
45+
| `apply_chat_template` | bool/string | `false` | Chat template |
46+
| `system_instruction` | string/null | `null` | System prompt |
47+
| `fewshot_as_multiturn` | bool/null | `null` | Multi-turn few-shot |
48+
| `include_path` | string/null | `null` | External tasks path |
49+
| `gen_kwargs` | dict | `{}` | Generation arguments |
50+
| `wandb_args` | dict | `{}` | W&B init arguments |
51+
| `hf_hub_log_args` | dict | `{}` | HF Hub logging |
52+
| `seed` | list/int | `[0,1234,1234,1234]` | Random seeds |
53+
| `trust_remote_code` | bool | `false` | Trust remote code |
54+
| `metadata` | dict | `{}` | Task metadata |
55+
56+
---
57+
58+
## Example
59+
60+
```yaml
61+
# basic_eval.yaml
62+
model: hf
63+
model_args:
64+
pretrained: gpt2
65+
dtype: float32
66+
67+
tasks:
68+
- hellaswag
69+
- arc_easy
70+
71+
num_fewshot: 0
72+
batch_size: auto
73+
device: cuda:0
74+
75+
output_path: ./results/gpt2/
76+
log_samples: true
77+
78+
wandb_args:
79+
project: llm-evals
80+
name: mistral-7b-instruct
81+
tags:
82+
- mistral
83+
- instruct
84+
- production
85+
86+
hf_hub_log_args:
87+
hub_results_org: my-org
88+
results_repo_name: llm-eval-results
89+
push_results_to_hub: true
90+
public_repo: false
91+
```
92+
93+
---
94+
95+
## Programmatic Usage
96+
97+
For loading config files in Python, see the [Python API Guide](python-api.md#using-evaluatorconfig).
98+
99+
---
100+
101+
## Validation
102+
103+
Validate your configuration before running:
104+
105+
```bash
106+
# Check that tasks exist
107+
lm-eval validate --tasks hellaswag,arc_easy
108+
109+
# With external tasks
110+
lm-eval validate --tasks my_task --include_path /path/to/tasks
111+
```
112+
113+
---
114+
115+
## Tips
116+
117+
1. **Start simple**: Begin with minimal config and add options as needed
118+
2. **Use CLI overrides**: Set defaults in config, override with CLI for experiments
119+
3. **Separate concerns**: Create different configs for different model families or task sets
120+
4. **Version control**: Commit config files alongside results for reproducibility
121+
5. **Use comments**: YAML supports `#` comments to document your choices

0 commit comments

Comments
 (0)