Skip to content

Commit acffc1a

Browse files
clefourrierNathanHBlewtun
authored
Adding custom metric system + IFEval as an example (#48)
--------- Co-authored-by: Nathan Habib <[email protected]> Co-authored-by: lewtun <[email protected]>
1 parent 3785d85 commit acffc1a

File tree

8 files changed

+3578
-25
lines changed

8 files changed

+3578
-25
lines changed

README.md

Lines changed: 50 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -163,6 +163,29 @@ python run_evals_accelerate.py \
163163
--output_dir output_dir
164164
```
165165

166+
### Evaluate a model on community submitted/custom tasks.
167+
168+
You can use `lighteval` to evaluate models on custom or community submitted tasks. Select your task of interest (which might have its own requirements to install first), and run:
169+
170+
```shell
171+
python run_evals_accelerate.py \
172+
--model_args="pretrained=<path to model on the hub>"\
173+
--tasks <task parameters> \
174+
--custom_tasks <path to the main file containing the custom task>
175+
--output_dir output_dir
176+
```
177+
178+
For example, to launch `lighteval` on `ifeval` for `HuggingFaceH4/zephyr-7b-beta`, do
179+
```shell
180+
python run_evals_accelerate.py \
181+
--model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
182+
--use_chat_template \ # optional, if you want to run the evaluation with the chat template
183+
--tasks "custom|ifeval|0|0" \
184+
--custom_tasks "tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py" \
185+
--output_dir output_dir
186+
```
187+
188+
166189
## Deep thanks
167190
`lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
168191

@@ -184,29 +207,7 @@ However, we are very grateful to the Harness and HELM teams for their continued
184207
- [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
185208

186209
## Customisation
187-
### Adding a new metric
188-
First check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`.
189-
190-
If not, you can use the custom_task system to register your new metric:
191-
- create a new python file which should contain the full logic of your metric.
192-
- the file also needs to start with these imports
193-
```python
194-
from aenum import extend_enum
195-
from lighteval.metrics import Metrics
196-
197-
# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
198-
```
199-
200-
- and to end with the following, so that it adds your metric to our metrics list when loaded as a module.
201-
202-
```python
203-
# Adds the metric to the metric list!
204-
extend_enum(Metrics, "ifeval_metric", ifeval_metrics)
205-
if __name__ == "__main__":
206-
print("Imported metric")
207-
```
208-
209-
You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
210+
If your new task or metric has requirements, add a specific `requirements.txt` file with your evaluation.
210211

211212
### Adding a new task
212213
To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, or in the community tasks, and **add its dataset** on the hub.
@@ -244,6 +245,32 @@ Copy the `community_tasks/_template.yml` to `community_tasks/yourevalname.py` an
244245

245246
Make sure you can launch your model with your new task using `--tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py`.
246247

248+
### Adding a new metric
249+
First check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`.
250+
251+
If not, you can use the custom_task system to register your new metric:
252+
- create a new python file which should contain the full logic of your metric.
253+
- the file also needs to start with these imports
254+
```python
255+
from aenum import extend_enum
256+
from lighteval.metrics import Metrics
257+
258+
# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
259+
```
260+
261+
- and to end with the following, so that it adds your metric to our metrics list when loaded as a module.
262+
263+
```python
264+
# Adds the metric to the metric list!
265+
extend_enum(Metrics, "metric_name", metric_function)
266+
if __name__ == "__main__":
267+
print("Imported metric")
268+
```
269+
270+
You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
271+
272+
To see an example of a custom metric added along with a custom task, look at `tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
273+
247274
## Available metrics
248275
### Metrics for multiple choice tasks
249276
These metrics use log-likelihood of the different possible targets.

pyproject.toml

Lines changed: 0 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -60,7 +60,6 @@ dependencies = [
6060
"termcolor==2.3.0",
6161
"pytablewriter",
6262
"colorama",
63-
6463
# Extension of metrics
6564
"aenum==3.1.15",
6665
# Base metrics

src/lighteval/metrics/__init__.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -58,7 +58,7 @@ def apply_generative_metric(results: list[ModelReturn], formatted_doc: Doc, metr
5858
# Extracting gold
5959
try:
6060
golds = formatted_doc.get_golds()
61-
except KeyError:
61+
except (KeyError, IndexError):
6262
golds = None
6363

6464
# Specific process for HELM like evals # hrm
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
import numpy as np
2+
from aenum import extend_enum
3+
4+
import tasks_examples.custom_tasks_with_custom_metrics.ifeval.instructions_registry as instructions_registry
5+
from lighteval.metrics import Metrics
6+
from lighteval.metrics.utils import (
7+
MetricCategory,
8+
MetricUseCase,
9+
SampleLevelMetricGrouping,
10+
)
11+
from lighteval.tasks.lighteval_task import LightevalTaskConfig
12+
from lighteval.tasks.requests import Doc
13+
14+
15+
# We create the task config
16+
ifeval = LightevalTaskConfig(
17+
name="ifeval",
18+
prompt_function="ifeval_prompt",
19+
suite=["custom"],
20+
hf_repo="wis-k/instruction-following-eval",
21+
hf_subset="default",
22+
metric=["ifeval_metric"],
23+
hf_avail_splits=["train"],
24+
evaluation_splits=["train"],
25+
few_shots_split="train",
26+
few_shots_select="random_sampling",
27+
generation_size=1280,
28+
stop_sequence=[], # no stop sequence, will use eot token
29+
)
30+
31+
32+
# very specific task where there are no precise outputs but instead we test if the format obeys rules
33+
def ifeval_prompt(line, task_name: str = None):
34+
return Doc(
35+
task_name=task_name,
36+
query=line["prompt"],
37+
choices=[""],
38+
gold_index=0,
39+
instruction="",
40+
specific={"instructions_id_list": line["instruction_id_list"], "kwargs": line["kwargs"]},
41+
)
42+
43+
44+
submetric_names = [
45+
"prompt_level_strict_acc",
46+
"inst_level_strict_acc",
47+
"prompt_level_loose_acc",
48+
"inst_level_loose_acc",
49+
]
50+
51+
52+
def ifeval_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
53+
response = predictions[0]
54+
55+
# Strict instructions
56+
instruction_list = formatted_doc.specific["instructions_id_list"]
57+
all_kwargs = formatted_doc.specific["kwargs"]
58+
prompt = formatted_doc.query
59+
60+
# Loose instructions
61+
r = response.split("\n")
62+
response_remove_first = "\n".join(r[1:]).strip()
63+
response_remove_last = "\n".join(r[:-1]).strip()
64+
response_remove_both = "\n".join(r[1:-1]).strip()
65+
revised_response = response.replace("*", "")
66+
revised_response_remove_first = response_remove_first.replace("*", "")
67+
revised_response_remove_last = response_remove_last.replace("*", "")
68+
revised_response_remove_both = response_remove_both.replace("*", "")
69+
all_responses = [
70+
response,
71+
revised_response,
72+
response_remove_first,
73+
response_remove_last,
74+
response_remove_both,
75+
revised_response_remove_first,
76+
revised_response_remove_last,
77+
revised_response_remove_both,
78+
]
79+
80+
is_following_list_strict = []
81+
is_following_list_loose = []
82+
83+
for index, instruction_id in enumerate(instruction_list):
84+
instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
85+
instruction = instruction_cls(instruction_id)
86+
87+
# Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
88+
task_kwargs = {k: v for k, v in all_kwargs[index].items() if v}
89+
instruction.build_description(**task_kwargs)
90+
args = instruction.get_instruction_args()
91+
if args and "prompt" in args:
92+
instruction.build_description(prompt=prompt)
93+
94+
# Strict
95+
if response.strip() and instruction.check_following(response):
96+
is_following_list_strict.append(True)
97+
else:
98+
is_following_list_strict.append(False)
99+
100+
# Loose
101+
is_following = False
102+
for r in all_responses:
103+
if r.strip() and instruction.check_following(r):
104+
is_following = True
105+
break
106+
107+
is_following_list_loose.append(is_following)
108+
109+
return {
110+
"prompt_level_strict_acc": int(all(is_following_list_strict)),
111+
"inst_level_strict_acc": is_following_list_strict,
112+
"prompt_level_loose_acc": int(all(is_following_list_loose)),
113+
"inst_level_loose_acc": is_following_list_loose,
114+
}
115+
116+
117+
def agg_inst_level_acc(items):
118+
flat_items = [item for sublist in items for item in sublist]
119+
inst_level_acc = sum(flat_items) / len(flat_items)
120+
return inst_level_acc
121+
122+
123+
ifeval_metrics = SampleLevelMetricGrouping(
124+
metric=submetric_names,
125+
higher_is_better={n: True for n in submetric_names},
126+
category=MetricCategory.GENERATIVE,
127+
use_case=MetricUseCase.ACCURACY,
128+
sample_level_fn=ifeval_metric,
129+
corpus_level_fn={
130+
"prompt_level_strict_acc": np.mean,
131+
"inst_level_strict_acc": agg_inst_level_acc,
132+
"prompt_level_loose_acc": np.mean,
133+
"inst_level_loose_acc": agg_inst_level_acc,
134+
},
135+
)
136+
137+
138+
_TASKS = [ifeval]
139+
140+
# Convert to dict for lighteval
141+
TASKS_TABLE = [task.as_dict() for task in _TASKS]
142+
extend_enum(Metrics, "ifeval_metric", ifeval_metrics)
143+
144+
if __name__ == "__main__":
145+
# Adds the metric to the metric list!
146+
print(t["name"] for t in TASKS_TABLE)
147+
print(len(TASKS_TABLE))

0 commit comments

Comments
 (0)