huggingface
diff --git a/‎README.md‎
Lines changed: 50 additions & 23 deletions b/‎README.md‎
Lines changed: 50 additions & 23 deletions
diff --git a/‎pyproject.toml‎
Lines changed: 0 additions & 1 deletion b/‎pyproject.toml‎
Lines changed: 0 additions & 1 deletion
diff --git a/‎src/lighteval/metrics/__init__.py‎
Lines changed: 1 addition & 1 deletion b/‎src/lighteval/metrics/__init__.py‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py‎
Lines changed: 147 additions & 0 deletions b/‎tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py‎
Lines changed: 147 additions & 0 deletions
@@ -163,6 +163,29 @@ python run_evals_accelerate.py \
     --output_dir output_dir
 ```
 
+### Evaluate a model on community submitted/custom tasks.
+
+You can use `lighteval` to evaluate models on custom or community submitted tasks. Select your task of interest (which might have its own requirements to install first), and run:
+
+```shell
+python run_evals_accelerate.py \
+    --model_args="pretrained=<path to model on the hub>"\
+    --tasks <task parameters> \
+    --custom_tasks <path to the main file containing the custom task>
+    --output_dir output_dir
+```
+
+For example, to launch `lighteval` on `ifeval` for `HuggingFaceH4/zephyr-7b-beta`, do
+```shell
+python run_evals_accelerate.py \
+    --model_args "pretrained=HuggingFaceH4/zephyr-7b-beta" \
+    --use_chat_template \ # optional, if you want to run the evaluation with the chat template
+    --tasks "custom|ifeval|0|0" \
+    --custom_tasks "tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py" \
+    --output_dir output_dir
+```
+
+
 ## Deep thanks
 `lighteval` was originally built on top of the great [Eleuther AI Harness](https://github.com/EleutherAI/lm-evaluation-harness) (which is powering the [Open LLM Leaderboard](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard)). We also took a lot of inspiration from the amazing [HELM](https://crfm.stanford.edu/helm/latest/), notably for metrics.
 
@@ -184,29 +207,7 @@ However, we are very grateful to the Harness and HELM teams for their continued
 - [tests](https://github.com/huggingface/lighteval/tree/main/tests) contains our test suite, that we run at each PR to prevent regressions in metrics/prompts/tasks, for a subset of important tasks.
 
 ## Customisation
-### Adding a new metric
-First check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`.
-
-If not, you can use the custom_task system to register your new metric:
-- create a new python file which should contain the full logic of your metric.
-- the file also needs to start with these imports
-```python
-from aenum import extend_enum
-from lighteval.metrics import Metrics
-
-# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
-```
-
-- and to end with the following, so that it adds your metric to our metrics list when loaded as a module.
-
-```python
-# Adds the metric to the metric list!
-extend_enum(Metrics, "ifeval_metric", ifeval_metrics)
-if __name__ == "__main__":
-    print("Imported metric")
-```
-
-You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
+If your new task or metric has requirements, add a specific `requirements.txt` file with your evaluation.
 
 ### Adding a new task
 To add a new task, first either open an issue, to determine whether it will be integrated in the core evaluations of lighteval, or in the community tasks, and **add its dataset** on the hub.
@@ -244,6 +245,32 @@ Copy the `community_tasks/_template.yml` to `community_tasks/yourevalname.py` an
 
 Make sure you can launch your model with your new task using `--tasks community|yournewtask|2|0 --custom_tasks community_tasks/yourevalname.py`.
 
+### Adding a new metric
+First check if you can use one of the parametrized functions in `src.lighteval.metrics.metrics_corpus` or `src.lighteval.metrics.metrics_sample`.
+
+If not, you can use the custom_task system to register your new metric:
+- create a new python file which should contain the full logic of your metric.
+- the file also needs to start with these imports
+```python
+from aenum import extend_enum
+from lighteval.metrics import Metrics
+
+# And any other class you might need to redefine your specific metric, depending on whether it's a sample or corpus metric.
+```
+
+- and to end with the following, so that it adds your metric to our metrics list when loaded as a module.
+
+```python
+# Adds the metric to the metric list!
+extend_enum(Metrics, "metric_name", metric_function)
+if __name__ == "__main__":
+    print("Imported metric")
+```
+
+You can then give your custom metric to lighteval by using `--custom-tasks path_to_your_file` when launching it.
+
+To see an example of a custom metric added along with a custom task, look at `tasks_examples/custom_tasks_with_custom_metrics/ifeval/ifeval.py`.
+
 ## Available metrics
 ### Metrics for multiple choice tasks
 These metrics use log-likelihood of the different possible targets.
 
@@ -60,7 +60,6 @@ dependencies = [
     "termcolor==2.3.0",
     "pytablewriter",
     "colorama",
-
     # Extension of metrics
     "aenum==3.1.15",
     # Base metrics
 
@@ -58,7 +58,7 @@ def apply_generative_metric(results: list[ModelReturn], formatted_doc: Doc, metr
     # Extracting gold
     try:
         golds = formatted_doc.get_golds()
-    except KeyError:
+    except (KeyError, IndexError):
         golds = None
 
     # Specific process for HELM like evals # hrm
 
@@ -0,0 +1,147 @@
+import numpy as np
+from aenum import extend_enum
+
+import tasks_examples.custom_tasks_with_custom_metrics.ifeval.instructions_registry as instructions_registry
+from lighteval.metrics import Metrics
+from lighteval.metrics.utils import (
+    MetricCategory,
+    MetricUseCase,
+    SampleLevelMetricGrouping,
+)
+from lighteval.tasks.lighteval_task import LightevalTaskConfig
+from lighteval.tasks.requests import Doc
+
+
+# We create the task config
+ifeval = LightevalTaskConfig(
+    name="ifeval",
+    prompt_function="ifeval_prompt",
+    suite=["custom"],
+    hf_repo="wis-k/instruction-following-eval",
+    hf_subset="default",
+    metric=["ifeval_metric"],
+    hf_avail_splits=["train"],
+    evaluation_splits=["train"],
+    few_shots_split="train",
+    few_shots_select="random_sampling",
+    generation_size=1280,
+    stop_sequence=[],  # no stop sequence, will use eot token
+)
+
+
+# very specific task where there are no precise outputs but instead we test if the format obeys rules
+def ifeval_prompt(line, task_name: str = None):
+    return Doc(
+        task_name=task_name,
+        query=line["prompt"],
+        choices=[""],
+        gold_index=0,
+        instruction="",
+        specific={"instructions_id_list": line["instruction_id_list"], "kwargs": line["kwargs"]},
+    )
+
+
+submetric_names = [
+    "prompt_level_strict_acc",
+    "inst_level_strict_acc",
+    "prompt_level_loose_acc",
+    "inst_level_loose_acc",
+]
+
+
+def ifeval_metric(predictions: list[str], formatted_doc: Doc, **kwargs) -> dict:
+    response = predictions[0]
+
+    # Strict instructions
+    instruction_list = formatted_doc.specific["instructions_id_list"]
+    all_kwargs = formatted_doc.specific["kwargs"]
+    prompt = formatted_doc.query
+
+    # Loose instructions
+    r = response.split("\n")
+    response_remove_first = "\n".join(r[1:]).strip()
+    response_remove_last = "\n".join(r[:-1]).strip()
+    response_remove_both = "\n".join(r[1:-1]).strip()
+    revised_response = response.replace("*", "")
+    revised_response_remove_first = response_remove_first.replace("*", "")
+    revised_response_remove_last = response_remove_last.replace("*", "")
+    revised_response_remove_both = response_remove_both.replace("*", "")
+    all_responses = [
+        response,
+        revised_response,
+        response_remove_first,
+        response_remove_last,
+        response_remove_both,
+        revised_response_remove_first,
+        revised_response_remove_last,
+        revised_response_remove_both,
+    ]
+
+    is_following_list_strict = []
+    is_following_list_loose = []
+
+    for index, instruction_id in enumerate(instruction_list):
+        instruction_cls = instructions_registry.INSTRUCTION_DICT[instruction_id]
+        instruction = instruction_cls(instruction_id)
+
+        # Remove None values from kwargs to avoid unexpected keyword argument errors in build_description method.
+        task_kwargs = {k: v for k, v in all_kwargs[index].items() if v}
+        instruction.build_description(**task_kwargs)
+        args = instruction.get_instruction_args()
+        if args and "prompt" in args:
+            instruction.build_description(prompt=prompt)
+
+        # Strict
+        if response.strip() and instruction.check_following(response):
+            is_following_list_strict.append(True)
+        else:
+            is_following_list_strict.append(False)
+
+        # Loose
+        is_following = False
+        for r in all_responses:
+            if r.strip() and instruction.check_following(r):
+                is_following = True
+                break
+
+        is_following_list_loose.append(is_following)
+
+    return {
+        "prompt_level_strict_acc": int(all(is_following_list_strict)),
+        "inst_level_strict_acc": is_following_list_strict,
+        "prompt_level_loose_acc": int(all(is_following_list_loose)),
+        "inst_level_loose_acc": is_following_list_loose,
+    }
+
+
+def agg_inst_level_acc(items):
+    flat_items = [item for sublist in items for item in sublist]
+    inst_level_acc = sum(flat_items) / len(flat_items)
+    return inst_level_acc
+
+
+ifeval_metrics = SampleLevelMetricGrouping(
+    metric=submetric_names,
+    higher_is_better={n: True for n in submetric_names},
+    category=MetricCategory.GENERATIVE,
+    use_case=MetricUseCase.ACCURACY,
+    sample_level_fn=ifeval_metric,
+    corpus_level_fn={
+        "prompt_level_strict_acc": np.mean,
+        "inst_level_strict_acc": agg_inst_level_acc,
+        "prompt_level_loose_acc": np.mean,
+        "inst_level_loose_acc": agg_inst_level_acc,
+    },
+)
+
+
+_TASKS = [ifeval]
+
+# Convert to dict for lighteval
+TASKS_TABLE = [task.as_dict() for task in _TASKS]
+extend_enum(Metrics, "ifeval_metric", ifeval_metrics)
+
+if __name__ == "__main__":
+    # Adds the metric to the metric list!
+    print(t["name"] for t in TASKS_TABLE)
+    print(len(TASKS_TABLE))