Skip to content

Commit d7beacb

Browse files
authored
Added post processing (for reasoning tokens) to pipeline (#882)
1 parent 994e9e0 commit d7beacb

20 files changed

+823
-135
lines changed

docs/source/quicktour.mdx

Lines changed: 22 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -30,7 +30,7 @@ lighteval accelerate \
3030
"leaderboard|truthfulqa:mc|0|0"
3131
```
3232

33-
Here, we first choose a backend (either `accelerate`, `nanotron`, or `vllm`), and then specify the model and task(s) to run.
33+
Here, we first choose a backend (either `accelerate`, `nanotron`, `endpoint`, or `vllm`), and then specify the model and task(s) to run.
3434

3535
The syntax for the model arguments is `key1=value1,key2=value2,etc`.
3636
Valid key-value pairs correspond with the backend configuration, and are detailed [below](#Model Arguments).
@@ -104,13 +104,32 @@ GPUs.
104104

105105
## Backend configuration
106106

107+
#### General information
108+
107109
The `model-args` argument takes a string representing a list of model
108110
argument. The arguments allowed vary depending on the backend you use and
109111
correspond to the fields of the model configs.
110112

111-
The model config can be found [here](./package_reference/models).
113+
The model configurations can be found [here](./package_reference/models).
114+
115+
All models allow you to post process your reasoning model predictions,
116+
to remove the thinking tokens from the trace used to compute the metrics,
117+
using `--remove-reasoning-tags`, and `--reasoning-tags` to specify which
118+
reasoning tags to remove (defaults to <think> and </think>).
119+
120+
Here's an example with `mistralai/Magistral-Small-2507` which outputs custom
121+
think tokens.
122+
123+
```bash
124+
lighteval vllm \
125+
"model_name=mistralai/Magistral-Small-2507,dtype=float16,data_parallel_size=4" \
126+
"lighteval|aime24|0|0" \
127+
--remove-reasoning-tags \
128+
--reasoning-tags="[('[THINK]','[/THINK]')]"
129+
```
130+
112131

113-
## Nanotron
132+
#### Nanotron
114133

115134
To evaluate a model trained with nanotron on a single gpu.
116135

src/lighteval/logging/info_loggers.py

Lines changed: 4 additions & 21 deletions
Original file line numberDiff line numberDiff line change
@@ -170,27 +170,10 @@ class Detail:
170170
"""Experiment details of one single example of one task.
171171
172172
Attributes:
173-
example (str): Current task example query
174-
instruction (str): Instruction prepended to the example and few shots.
175-
For example "In this task, you are given information of type x. You need to predict y."
176-
full_prompt (str): Expanded full prompt (instruction if present, then prompt)
177-
num_effective_few_shots (int): Number of actual few shots used for the example.
178-
This depends on the model context length and few-shots samples size: when using effective few-shots,
179-
only `num_effective_few_shots` few-shot samples are kept, allowing
180-
1) each of the used few-shot examples and the prompt to not be truncated
181-
2) this context still allows the model to predict up to the requested max numbers of tokens within its remaining context size.
182-
num_asked_few_shots (int): Initially asked number of few-shot samples.
183-
predictions (list): List of the actual model predictions
184-
input_tokens (list): List of the input tokens given to the model
185-
cont_tokens (list): List of the continuation tokens predicted by the model
186-
truncated (list): Size of the truncations (if it was needed to fit the prompt in the model context length)
187-
padded (list): Size of the padding (if it was needed for the current example)
188-
gold (list): Example gold targets (for generative evaluations)
189-
pred_logits (list): List of the actual model predicted logits
190-
choices (list): List of the possible choices (for multichoice/loglikelihood evaluations)
191-
gold_index (list): Indices of the gold targets among the [`choices`]
192-
metrics (dict): Metric name to current example score
193-
173+
doc (Doc): The [`Doc`] object containing the current example information.
174+
model_response (ModelResponse): The [`ModelResponse`] object containing the model response for the current example.
175+
metric (dict): The metric scores for the current example.
176+
Example: {"accuracy": 0.5, "f1": 0.7, "exact_match": 0.6}
194177
"""
195178

196179
doc: Doc

src/lighteval/main_accelerate.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,16 @@ def accelerate( # noqa C901
6060
load_responses_from_details_date_id: Annotated[
6161
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
6262
] = None,
63+
remove_reasoning_tags: Annotated[
64+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
65+
] = True,
66+
reasoning_tags: Annotated[
67+
str | None,
68+
Option(
69+
help="List of reasoning tags (as pairs) to remove from responses. Default is [('<think>', '</think>')].",
70+
rich_help_panel=HELP_PANEL_NAME_1,
71+
),
72+
] = None,
6373
# === saving ===
6474
output_dir: Annotated[
6575
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -131,6 +141,8 @@ def accelerate( # noqa C901
131141
custom_tasks_directory=custom_tasks,
132142
num_fewshot_seeds=num_fewshot_seeds,
133143
max_samples=max_samples,
144+
remove_reasoning_tags=remove_reasoning_tags,
145+
reasoning_tags=reasoning_tags,
134146
load_responses_from_details_date_id=load_responses_from_details_date_id,
135147
)
136148

src/lighteval/main_custom.py

Lines changed: 28 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -31,10 +31,10 @@
3131
app = typer.Typer()
3232

3333

34-
HELP_PANNEL_NAME_1 = "Common Parameters"
35-
HELP_PANNEL_NAME_2 = "Logging Parameters"
36-
HELP_PANNEL_NAME_3 = "Debug Parameters"
37-
HELP_PANNEL_NAME_4 = "Modeling Parameters"
34+
HELP_PANEL_NAME_1 = "Common Parameters"
35+
HELP_PANEL_NAME_2 = "Logging Parameters"
36+
HELP_PANEL_NAME_3 = "Debug Parameters"
37+
HELP_PANEL_NAME_4 = "Modeling Parameters"
3838

3939

4040
@app.command(rich_help_panel="Evaluation Backends")
@@ -45,46 +45,56 @@ def custom(
4545
tasks: Annotated[str, Argument(help="Comma-separated list of tasks to evaluate on.")],
4646
# === Common parameters ===
4747
dataset_loading_processes: Annotated[
48-
int, Option(help="Number of processes to use for dataset loading.", rich_help_panel=HELP_PANNEL_NAME_1)
48+
int, Option(help="Number of processes to use for dataset loading.", rich_help_panel=HELP_PANEL_NAME_1)
4949
] = 1,
5050
custom_tasks: Annotated[
51-
Optional[str], Option(help="Path to custom tasks directory.", rich_help_panel=HELP_PANNEL_NAME_1)
51+
Optional[str], Option(help="Path to custom tasks directory.", rich_help_panel=HELP_PANEL_NAME_1)
5252
] = None,
5353
num_fewshot_seeds: Annotated[
54-
int, Option(help="Number of seeds to use for few-shot evaluation.", rich_help_panel=HELP_PANNEL_NAME_1)
54+
int, Option(help="Number of seeds to use for few-shot evaluation.", rich_help_panel=HELP_PANEL_NAME_1)
5555
] = 1,
56+
remove_reasoning_tags: Annotated[
57+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
58+
] = True,
59+
reasoning_tags: Annotated[
60+
str | None,
61+
Option(
62+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
63+
rich_help_panel=HELP_PANEL_NAME_1,
64+
),
65+
] = None,
5666
# === saving ===
5767
output_dir: Annotated[
58-
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANNEL_NAME_2)
68+
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
5969
] = "results",
6070
results_path_template: Annotated[
6171
str | None,
6272
Option(
6373
help="Template path for where to save the results, you have access to 3 variables, `output_dir`, `org` and `model`. for example a template can be `'{output_dir}/1234/{org}+{model}'`",
64-
rich_help_panel=HELP_PANNEL_NAME_2,
74+
rich_help_panel=HELP_PANEL_NAME_2,
6575
),
6676
] = None,
6777
push_to_hub: Annotated[
68-
bool, Option(help="Push results to the huggingface hub.", rich_help_panel=HELP_PANNEL_NAME_2)
78+
bool, Option(help="Push results to the huggingface hub.", rich_help_panel=HELP_PANEL_NAME_2)
6979
] = False,
7080
push_to_tensorboard: Annotated[
71-
bool, Option(help="Push results to tensorboard.", rich_help_panel=HELP_PANNEL_NAME_2)
81+
bool, Option(help="Push results to tensorboard.", rich_help_panel=HELP_PANEL_NAME_2)
7282
] = False,
7383
public_run: Annotated[
74-
bool, Option(help="Push results and details to a public repo.", rich_help_panel=HELP_PANNEL_NAME_2)
84+
bool, Option(help="Push results and details to a public repo.", rich_help_panel=HELP_PANEL_NAME_2)
7585
] = False,
7686
results_org: Annotated[
77-
Optional[str], Option(help="Organization to push results to.", rich_help_panel=HELP_PANNEL_NAME_2)
87+
Optional[str], Option(help="Organization to push results to.", rich_help_panel=HELP_PANEL_NAME_2)
7888
] = None,
7989
save_details: Annotated[
80-
bool, Option(help="Save detailed, sample per sample, results.", rich_help_panel=HELP_PANNEL_NAME_2)
90+
bool, Option(help="Save detailed, sample per sample, results.", rich_help_panel=HELP_PANEL_NAME_2)
8191
] = False,
8292
# === debug ===
8393
max_samples: Annotated[
84-
Optional[int], Option(help="Maximum number of samples to evaluate on.", rich_help_panel=HELP_PANNEL_NAME_3)
94+
Optional[int], Option(help="Maximum number of samples to evaluate on.", rich_help_panel=HELP_PANEL_NAME_3)
8595
] = None,
8696
job_id: Annotated[
87-
int, Option(help="Optional job id for future refenrence.", rich_help_panel=HELP_PANNEL_NAME_3)
97+
int, Option(help="Optional job id for future refenrence.", rich_help_panel=HELP_PANEL_NAME_3)
8898
] = 0,
8999
):
90100
"""
@@ -113,6 +123,8 @@ def custom(
113123
custom_tasks_directory=custom_tasks,
114124
num_fewshot_seeds=num_fewshot_seeds,
115125
max_samples=max_samples,
126+
remove_reasoning_tags=remove_reasoning_tags,
127+
reasoning_tags=reasoning_tags,
116128
)
117129
pipeline = Pipeline(
118130
tasks=tasks,

src/lighteval/main_endpoint.py

Lines changed: 48 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -62,6 +62,16 @@ def inference_endpoint(
6262
load_responses_from_details_date_id: Annotated[
6363
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
6464
] = None,
65+
remove_reasoning_tags: Annotated[
66+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
67+
] = True,
68+
reasoning_tags: Annotated[
69+
str | None,
70+
Option(
71+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
72+
rich_help_panel=HELP_PANEL_NAME_1,
73+
),
74+
] = None,
6575
# === saving ===
6676
output_dir: Annotated[
6777
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -136,6 +146,8 @@ def inference_endpoint(
136146
num_fewshot_seeds=num_fewshot_seeds,
137147
max_samples=max_samples,
138148
load_responses_from_details_date_id=load_responses_from_details_date_id,
149+
remove_reasoning_tags=remove_reasoning_tags,
150+
reasoning_tags=reasoning_tags,
139151
)
140152
pipeline = Pipeline(
141153
tasks=tasks,
@@ -175,6 +187,16 @@ def tgi(
175187
load_responses_from_details_date_id: Annotated[
176188
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
177189
] = None,
190+
remove_reasoning_tags: Annotated[
191+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
192+
] = True,
193+
reasoning_tags: Annotated[
194+
str | None,
195+
Option(
196+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
197+
rich_help_panel=HELP_PANEL_NAME_1,
198+
),
199+
] = None,
178200
# === saving ===
179201
output_dir: Annotated[
180202
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -253,6 +275,8 @@ def tgi(
253275
num_fewshot_seeds=num_fewshot_seeds,
254276
max_samples=max_samples,
255277
load_responses_from_details_date_id=load_responses_from_details_date_id,
278+
remove_reasoning_tags=remove_reasoning_tags,
279+
reasoning_tags=reasoning_tags,
256280
)
257281
pipeline = Pipeline(
258282
tasks=tasks,
@@ -295,6 +319,16 @@ def litellm(
295319
load_responses_from_details_date_id: Annotated[
296320
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
297321
] = None,
322+
remove_reasoning_tags: Annotated[
323+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
324+
] = True,
325+
reasoning_tags: Annotated[
326+
str | None,
327+
Option(
328+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
329+
rich_help_panel=HELP_PANEL_NAME_1,
330+
),
331+
] = None,
298332
# === saving ===
299333
output_dir: Annotated[
300334
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -376,6 +410,8 @@ def litellm(
376410
num_fewshot_seeds=num_fewshot_seeds,
377411
max_samples=max_samples,
378412
load_responses_from_details_date_id=load_responses_from_details_date_id,
413+
remove_reasoning_tags=remove_reasoning_tags,
414+
reasoning_tags=reasoning_tags,
379415
)
380416
pipeline = Pipeline(
381417
tasks=tasks,
@@ -449,6 +485,16 @@ def inference_providers(
449485
rich_help_panel=HELP_PANEL_NAME_2,
450486
),
451487
] = False,
488+
remove_reasoning_tags: Annotated[
489+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
490+
] = True,
491+
reasoning_tags: Annotated[
492+
str | None,
493+
Option(
494+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
495+
rich_help_panel=HELP_PANEL_NAME_1,
496+
),
497+
] = None,
452498
# === debug ===
453499
max_samples: Annotated[
454500
Optional[int], Option(help="Maximum number of samples to evaluate on.", rich_help_panel=HELP_PANEL_NAME_3)
@@ -493,6 +539,8 @@ def inference_providers(
493539
num_fewshot_seeds=num_fewshot_seeds,
494540
max_samples=max_samples,
495541
load_responses_from_details_date_id=None,
542+
remove_reasoning_tags=remove_reasoning_tags,
543+
reasoning_tags=reasoning_tags,
496544
)
497545
pipeline = Pipeline(
498546
tasks=tasks,

src/lighteval/main_nanotron.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,6 +43,16 @@ def nanotron(
4343
str, Option(help="Path to the nanotron checkpoint YAML or python config file, potentially on s3.")
4444
],
4545
lighteval_config_path: Annotated[str, Option(help="Path to a YAML config to be used for the evaluation.")],
46+
remove_reasoning_tags: Annotated[
47+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
48+
] = True,
49+
reasoning_tags: Annotated[
50+
str | None,
51+
Option(
52+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
53+
rich_help_panel=HELP_PANEL_NAME_1,
54+
),
55+
] = None,
4656
):
4757
"""
4858
Evaluate models using nanotron as backend.
@@ -101,6 +111,8 @@ def nanotron(
101111
custom_tasks_directory=lighteval_config.tasks.custom_tasks,
102112
num_fewshot_seeds=1,
103113
max_samples=lighteval_config.tasks.max_samples,
114+
remove_reasoning_tags=remove_reasoning_tags,
115+
reasoning_tags=reasoning_tags,
104116
)
105117

106118
pipeline = Pipeline(

src/lighteval/main_sglang.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -53,6 +53,16 @@ def sglang(
5353
load_responses_from_details_date_id: Annotated[
5454
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
5555
] = None,
56+
remove_reasoning_tags: Annotated[
57+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
58+
] = True,
59+
reasoning_tags: Annotated[
60+
str | None,
61+
Option(
62+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
63+
rich_help_panel=HELP_PANEL_NAME_1,
64+
),
65+
] = None,
5666
# === saving ===
5767
output_dir: Annotated[
5868
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -122,6 +132,8 @@ def sglang(
122132
num_fewshot_seeds=num_fewshot_seeds,
123133
max_samples=max_samples,
124134
load_responses_from_details_date_id=load_responses_from_details_date_id,
135+
remove_reasoning_tags=remove_reasoning_tags,
136+
reasoning_tags=reasoning_tags,
125137
)
126138

127139
if model_args.endswith(".yaml"):

src/lighteval/main_vllm.py

Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,16 @@ def vllm(
5656
load_responses_from_details_date_id: Annotated[
5757
Optional[str], Option(help="Load responses from details directory.", rich_help_panel=HELP_PANEL_NAME_1)
5858
] = None,
59+
remove_reasoning_tags: Annotated[
60+
bool, Option(help="Remove reasoning tags from responses.", rich_help_panel=HELP_PANEL_NAME_1)
61+
] = False,
62+
reasoning_tags: Annotated[
63+
str | None,
64+
Option(
65+
help="List of reasoning tags (provided as pairs) to remove from responses. Default is [('<think>', '</think>')].",
66+
rich_help_panel=HELP_PANEL_NAME_1,
67+
),
68+
] = None,
5969
# === saving ===
6070
output_dir: Annotated[
6171
str, Option(help="Output directory for evaluation results.", rich_help_panel=HELP_PANEL_NAME_2)
@@ -126,6 +136,8 @@ def vllm(
126136
max_samples=max_samples,
127137
cot_prompt=cot_prompt,
128138
load_responses_from_details_date_id=load_responses_from_details_date_id,
139+
remove_reasoning_tags=remove_reasoning_tags,
140+
reasoning_tags=reasoning_tags,
129141
)
130142

131143
if model_args.endswith(".yaml"):

src/lighteval/metrics/dynamic_metrics.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -236,7 +236,7 @@ def add_to_specifics_with_timeout(
236236

237237
def sample_level_fn(doc: Doc, model_response: ModelResponse) -> float:
238238
golds = doc.get_golds()
239-
predictions = model_response.text
239+
predictions = model_response.final_text
240240

241241
gold_extraction_regexes = get_extraction_regexes(doc, gold_extraction_target, language)
242242
pred_extraction_regexes = get_extraction_regexes(doc, pred_extraction_target, language)

0 commit comments

Comments
 (0)