Skip to content

Commit 28ddc54

Browse files
committed
Merge remote-tracking branch 'origin/main' into v0.12-release
2 parents b77c6b2 + 31433cc commit 28ddc54

File tree

325 files changed

+381
-1382
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

325 files changed

+381
-1382
lines changed

README.md

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -127,7 +127,7 @@ Did not find what you need ? You can always make your custom model API by follow
127127
Here's a **quick command** to evaluate using the *Accelerate backend*:
128128

129129
```shell
130-
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
130+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond
131131
```
132132

133133
Or use the **Python API** to run a model *already loaded in memory*!
@@ -141,7 +141,7 @@ from lighteval.pipeline import ParallelismManager, Pipeline, PipelineParameters
141141

142142

143143
MODEL_NAME = "meta-llama/Meta-Llama-3-8B-Instruct"
144-
BENCHMARKS = "lighteval|gsm8k|0"
144+
BENCHMARKS = "gsm8k"
145145

146146
evaluation_tracker = EvaluationTracker(output_dir="./results")
147147
pipeline_params = PipelineParameters(

docs/source/adding-a-custom-task.mdx

Lines changed: 3 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -78,7 +78,6 @@ from lighteval.tasks.lighteval_task import LightevalTaskConfig
7878
task = LightevalTaskConfig(
7979
name="myothertask",
8080
prompt_function=prompt_fn, # Must be defined in the file or imported
81-
suite=["community"],
8281
hf_repo="your_dataset_repo_on_hf",
8382
hf_subset="default",
8483
hf_avail_splits=["train", "test"],
@@ -115,7 +114,6 @@ class CustomSubsetTask(LightevalTaskConfig):
115114
evaluation_splits=["test"],
116115
few_shots_split="train",
117116
few_shots_select="random_sampling_from_train",
118-
suite=["lighteval"],
119117
generation_size=256,
120118
stop_sequence=["\n", "Question:"],
121119
)
@@ -149,22 +147,16 @@ Once your file is created, you can run the evaluation with the following command
149147
```bash
150148
lighteval accelerate \
151149
"model_name=HuggingFaceH4/zephyr-7b-beta" \
152-
"lighteval|{task}|{fewshots}" \
150+
{task} \
153151
--custom-tasks {path_to_your_custom_task_file}
154152
```
155153

156154
### Example Usage
157155

158156
```bash
159-
# Run a custom task with zero-shot evaluation
157+
# Run a custom task with 3 shot evaluation
160158
lighteval accelerate \
161159
"model_name=openai-community/gpt2" \
162-
"lighteval|myothertask|0" \
163-
--custom-tasks community_tasks/my_custom_task.py
164-
165-
# Run a custom task with few-shot evaluation
166-
lighteval accelerate \
167-
"model_name=openai-community/gpt2" \
168-
"lighteval|myothertask|3" \
160+
"myothertask|3" \
169161
--custom-tasks community_tasks/my_custom_task.py
170162
```

docs/source/adding-a-new-metric.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -146,7 +146,7 @@ path_to_your_file` when launching it after adding it to the task config.
146146
```bash
147147
lighteval accelerate \
148148
"model_name=openai-community/gpt2" \
149-
"leaderboard|truthfulqa:mc|0" \
149+
"truthfulqa:mc" \
150150
--custom-tasks path_to_your_metric_file.py
151151
```
152152

docs/source/available-tasks.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,5 +26,5 @@ lighteval tasks inspect <task_name>
2626

2727
Example:
2828
```bash
29-
lighteval tasks inspect "lighteval|truthfulqa:mc|0"
29+
lighteval tasks inspect truthfulqa:mc
3030
```

docs/source/evaluating-a-custom-model.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ You can evaluate your custom model using either the command-line interface or th
5959
lighteval custom \
6060
"google-translate" \
6161
"examples/custom_models/google_translate_model.py" \
62-
"lighteval|wmt20:fr-de|0" \
62+
"wmt20:fr-de \
6363
--max-samples 10
6464
```
6565
@@ -94,7 +94,7 @@ model_config = CustomModelConfig(
9494
9595
# Create and run the pipeline
9696
pipeline = Pipeline(
97-
tasks="leaderboard|truthfulqa:mc|0",
97+
tasks=truthfulqa:mc,
9898
pipeline_parameters=pipeline_params,
9999
evaluation_tracker=evaluation_tracker,
100100
model_config=model_config

docs/source/index.mdx

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -59,7 +59,7 @@ pip install lighteval
5959

6060
```bash
6161
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" \
62-
"lighteval|gpqa:diamond|0" \
62+
gpqa:diamond \
6363
--bundle-dir gpt-oss-bundle \
6464
--repo-id OpenEvals/evals
6565
```

docs/source/inspect-ai.mdx

Lines changed: 17 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -21,13 +21,13 @@ Once you've chosen a benchmark, run it with `lighteval eval`. Below are examples
2121
1. Evaluate a model via Hugging Face Inference Providers.
2222

2323
```bash
24-
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0"
24+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond
2525
```
2626

2727
2. Run multiple evals at the same time.
2828

2929
```bash
30-
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" "lighteval|gpqa:diamond|0,lighteval|aime25|0"
30+
lighteval eval "hf-inference-providers/openai/gpt-oss-20b" gpqa:diamond,aime25
3131
```
3232

3333
3. Compare providers for the same model.
@@ -37,25 +37,32 @@ lighteval eval \
3737
hf-inference-providers/openai/gpt-oss-20b:fireworks-ai \
3838
hf-inference-providers/openai/gpt-oss-20b:together \
3939
hf-inference-providers/openai/gpt-oss-20b:nebius \
40+
gpqa:diamond
41+
```
42+
43+
You can also compare every providers serving one model in one line:
44+
45+
```bash
46+
hf-inference-providers/openai/gpt-oss-20b:all \
4047
"lighteval|gpqa:diamond|0"
4148
```
4249

4350
4. Evaluate a vLLM or SGLang model.
4451

4552
```bash
46-
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct "lighteval|gpqa:diamond|0"
53+
lighteval eval vllm/HuggingFaceTB/SmolLM-135M-Instruct gpqa:diamond
4754
```
4855

4956
5. See the impact of few-shot on your model.
5057

5158
```bash
52-
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0,lighteval|gsm8k|5"
59+
lighteval eval hf-inference-providers/openai/gpt-oss-20b "gsm8k|0,gsm8k|5"
5360
```
5461

5562
6. Optimize custom server connections.
5663

5764
```bash
58-
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \
65+
lighteval eval hf-inference-providers/openai/gpt-oss-20b gsm8k \
5966
--max-connections 50 \
6067
--timeout 30 \
6168
--retry-on-error 1 \
@@ -66,13 +73,13 @@ lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|gsm8k|0" \
6673
7. Use multiple epochs for more reliable results.
6774

6875
```bash
69-
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --epochs 16 --epochs-reducer "pass_at_4"
76+
lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --epochs 16 --epochs-reducer "pass_at_4"
7077
```
7178

7279
8. Push to the Hub to share results.
7380

7481
```bash
75-
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|hle|0" \
82+
lighteval eval hf-inference-providers/openai/gpt-oss-20b hle \
7683
--bundle-dir gpt-oss-bundle \
7784
--repo-id OpenEvals/evals \
7885
--max-samples 100
@@ -92,17 +99,17 @@ Resulting Space:
9299
You can use any argument defined in inspect-ai's API.
93100

94101
```bash
95-
lighteval eval hf-inference-providers/openai/gpt-oss-20b "lighteval|aime25|0" --temperature 0.1
102+
lighteval eval hf-inference-providers/openai/gpt-oss-20b aime25 --temperature 0.1
96103
```
97104

98105
10. Use model-args to use any inference provider specific argument.
99106

100107
```bash
101-
lighteval eval google/gemini-2.5-pro "lighteval|aime25|0" --model-args location=us-east5
108+
lighteval eval google/gemini-2.5-pro aime25 --model-args location=us-east5
102109
```
103110

104111
```bash
105-
lighteval eval openai/gpt-4o "lighteval|gpqa:diamond|0" --model-args service_tier=flex,client_timeout=1200
112+
lighteval eval openai/gpt-4o gpqa:diamond --model-args service_tier=flex,client_timeout=1200
106113
```
107114

108115

docs/source/quicktour.mdx

Lines changed: 4 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@ Lighteval can be used with several different commands, each optimized for differ
2121

2222
### Evaluation Backends
2323

24+
- `lighteval eval`: Use [inspect-ai](https://inspect.aisi.org.uk/) as backend to evaluate and inspect your models ! (prefered way)
2425
- `lighteval accelerate`: Evaluate models on CPU or one or more GPUs using [🤗
2526
Accelerate](https://github.com/huggingface/accelerate)
2627
- `lighteval nanotron`: Evaluate models in distributed settings using [⚡️
@@ -54,11 +55,9 @@ To evaluate `GPT-2` on the Truthful QA benchmark with [🤗
5455
```bash
5556
lighteval accelerate \
5657
"model_name=openai-community/gpt2" \
57-
"leaderboard|truthfulqa:mc|0"
58+
truthfulqa:mc
5859
```
5960

60-
Here, we first choose a backend (either `accelerate`, `nanotron`, `endpoint`, or `vllm`), and then specify the model and task(s) to run.
61-
6261
### Task Specification
6362

6463
The syntax for the task specification might be a bit hard to grasp at first. The format is as follows:
@@ -96,7 +95,7 @@ When specifying a path to a file, it should start with `./`.
9695
lighteval accelerate \
9796
"model_name=openai-community/gpt2" \
9897
./path/to/lighteval/examples/tasks/recommended_set.txt
99-
# or, e.g., "leaderboard|truthfulqa:mc|0,leaderboard|gsm8k|3"
98+
# or, e.g., "truthfulqa:mc|0,gsm8k|3"
10099
```
101100

102101
## Backend Configuration
@@ -120,7 +119,7 @@ thinking tokens:
120119
```bash
121120
lighteval vllm \
122121
"model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
123-
"lighteval|aime24|0" \
122+
aime24 \
124123
--remove-reasoning-tags \
125124
--reasoning-tags="[('[THINK]','[/THINK]')]"
126125
```

docs/source/saving-and-reading-results.mdx

Lines changed: 5 additions & 5 deletions
Original file line numberDiff line numberDiff line change
@@ -69,7 +69,7 @@ import glob
6969
output_dir = "evals_doc"
7070
model_name = "HuggingFaceH4/zephyr-7b-beta"
7171
timestamp = "latest"
72-
task = "lighteval|gsm8k|0"
72+
task = "gsm8k"
7373

7474
if timestamp == "latest":
7575
path = f"{output_dir}/details/{model_name}/*/"
@@ -94,7 +94,7 @@ from datasets import load_dataset
9494
results_org = "SaylorTwift"
9595
model_name = "HuggingFaceH4/zephyr-7b-beta"
9696
sanitized_model_name = model_name.replace("/", "__")
97-
task = "lighteval|gsm8k|0"
97+
task = "gsm8k"
9898
public_run = False
9999

100100
dataset_path = f"{results_org}/details_{sanitized_model_name}{'_private' if not public_run else ''}"
@@ -192,7 +192,7 @@ The main results file contains several sections:
192192
"model_size": "476.2 MB"
193193
},
194194
"results": {
195-
"lighteval|gsm8k|0": {
195+
"gsm8k|0": {
196196
"em": 0.0,
197197
"em_stderr": 0.0,
198198
"maj@8": 0.0,
@@ -206,7 +206,7 @@ The main results file contains several sections:
206206
}
207207
},
208208
"versions": {
209-
"lighteval|gsm8k|0": 0
209+
"gsm8k|0": 0
210210
},
211211
"config_tasks": {
212212
"lighteval|gsm8k": {
@@ -257,7 +257,7 @@ The main results file contains several sections:
257257
}
258258
},
259259
"summary_tasks": {
260-
"lighteval|gsm8k|0": {
260+
"gsm8k|0": {
261261
"hashes": {
262262
"hash_examples": "8517d5bf7e880086",
263263
"hash_full_prompts": "8517d5bf7e880086",

docs/source/use-huggingface-inference-endpoints-or-tgi-as-backend.mdx

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -98,15 +98,15 @@ model_parameters:
9898
```bash
9999
lighteval endpoint inference-endpoint \
100100
"configs/endpoint_model.yaml" \
101-
"lighteval|gsm8k|0"
101+
gsm8k
102102
```
103103

104104
### Using an Existing TGI Server
105105

106106
```bash
107107
lighteval endpoint tgi \
108108
"configs/tgi_server.yaml" \
109-
"lighteval|gsm8k|0"
109+
gsm8k
110110
```
111111

112112
### Reusing an Existing Endpoint

0 commit comments

Comments
 (0)