Adds inspectai #1022

NathanHB · 2025-10-20T19:15:38Z

adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance

this allows for:
better logs
better paralelixzation
easier to add tasks

tasks compatible with inspect ai (at term all the tasks will be compatible):

gpqa (fewshot compatible)
ifeval
hle
gsm8k (fewshot compatible)
agieval
aime24,25

run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gpqa|0,lighteval|agieval|0,lighteval|aime25|0" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1

result:

|                                Model                                 |agieval|aime25|gpqa|
|----------------------------------------------------------------------|------:|-----:|---:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |   0.53|     0|0.33|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|   0.71|     1|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |   0.53|     0|0.20|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |   0.65|     0|0.75|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |   0.71|     0|0.25|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |   0.35|     0|0.25|

compare few shots diff on gsm8k

lighteval eval hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway \
hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nscale \
"lighteval|gsm8k|0,lighteval|gsm8k|3" \
max-connections 50 --timeout 30  --retry-on-error 1 --max-retries 5 --epochs 1 --max-samples 1

|                                Model                                 |gsm8k|gsm8k_3_shots|
|----------------------------------------------------------------------|----:|------------:|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:cerebras      |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:featherless-ai|  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:fireworks-ai  |  0.7|          0.8|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:nebius        |  0.6|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:novita        |  0.5|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:sambanova     |  0.7|          0.7|
|hf-inference-providers/meta-llama/Llama-3.1-8B-Instruct:scaleway      |  0.4|          0.8|

Copilot

Pull Request Overview

This PR adds integration with the Inspect AI framework to enable evaluation using Inspect AI's data model and scorers. It introduces support for new task configuration fields (solver, scorer, sample_fields, sample_to_fewshot, filter) and implements Inspect AI-compatible scorers for math and multiple-choice evaluations.

Key Changes

Added Inspect AI-compatible scorers (math_scorer, multichoice_scorer) and custom task scorers for IFEval and IFBench
Introduced new task configuration fields to support Inspect AI's Sample-based evaluation flow
Modified the get_extraction_regexes function signature to accept len_choices as a parameter instead of extracting it from a Doc object
Added InspectAIModelConfig class to support Inspect AI model configuration

Reviewed Changes

Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
src/lighteval/tasks/tasks/ifeval/main.py	Adds Inspect AI scorer and sample conversion for IFEval task evaluation
src/lighteval/tasks/tasks/ifbench/main.py	Adds Inspect AI scorer and sample conversion for IFBench task evaluation
src/lighteval/tasks/tasks/hle/main.py	Adds Inspect AI sample conversion and model-graded fact checker for HLE task
src/lighteval/tasks/tasks/gsm_plus.py	Adds math scorer with prompt template and sample conversion for GSM Plus task
src/lighteval/tasks/tasks/gsm8k.py	Adds math scorer with prompt template and sample conversion for GSM8K task
src/lighteval/tasks/tasks/gpqa.py	Adds multiple-choice solver and choice scorer with random answer shuffling for GPQA task
src/lighteval/tasks/tasks/aime.py	Adds math scorer with prompt template and sample conversion for AIME task
src/lighteval/tasks/tasks/agieval.py	Adds multiple-choice solver and choice scorer with sample conversion for AGIEval task
src/lighteval/tasks/lighteval_task.py	Adds Inspect AI compatible configuration fields to LightevalTaskConfig
src/lighteval/models/abstract_model.py	Adds InspectAIModelConfig class for Inspect AI model configuration
src/lighteval/metrics/utils/extractive_match_utils.py	Refactors `get_extraction_regexes` to accept `len_choices` parameter instead of `Doc` object
src/lighteval/metrics/metrics.py	Implements `math_scorer` and `multichoice_scorer` for Inspect AI integration
src/lighteval/main.py	Adds Inspect AI evaluation backend command

Comments suppressed due to low confidence (1)

src/lighteval/metrics/metrics.py:108

This statement is unreachable.

        return Score(value=1)

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

src/lighteval/metrics/metrics.py

src/lighteval/__main__.py

src/lighteval/models/abstract_model.py

src/lighteval/metrics/metrics.py

Co-authored-by: Copilot <[email protected]>

…hteval into nathan-move-to-inspectai

HuggingFaceDocBuilderDev · 2025-10-30T12:34:07Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

…hteval into nathan-move-to-inspectai

clefourrier

do you need a review at this stage?

src/lighteval/metrics/utils/extractive_match_utils.py

src/lighteval/metrics/metrics.py

src/lighteval/tasks/tasks/hle/main.py

NathanHB · 2025-10-30T15:38:08Z

do you need a review at this stage?
yes ! it's running, and even though we do have every tasks and every bells and whistles it should be good to review

clefourrier

lgtm, a couple nits atm, but cool work, looking forward to how the code base will simplify with inspect!

You're also missing a doc update for the whole new feature

src/lighteval/metrics/metrics.py

clefourrier · 2025-10-31T12:03:29Z

src/lighteval/metrics/metrics.py

+def multichoice_scorer():
+    language = Language.ENGLISH
+    gold_extraction_target = (
+        IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
+    )
+    pred_extraction_target = (
+        IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True),
+    )
+    fallback_mode = "first_match"
+    extraction_mode = "first_match"
+    timeout_seconds = 5
+
+    gold_extraction_regexes = get_extraction_regexes_inspect(gold_extraction_target, language)
+    pred_extraction_regexes = get_extraction_regexes_inspect(pred_extraction_target, language)


this behavior of nested functions behaving as classes is really meh for legibility, customizability and maintenability

definetely could be better ! but that's how inspect is expecting it. Will work on a better format once we start having more metrics compatible with it.

clefourrier · 2025-10-31T12:04:25Z

src/lighteval/models/abstract_model.py

+    max_tokens: int | None = None
+    system_message: str | None = None
+    temperature: float | None = None
+    top_p: float | None = None
+    top_k: int | None = None
+    frequence_penalty: float | None = None
+    presence_penalty: float | None = None
+    seed: int | None = None
+    stop_seqs: list[str] | None = None
+    num_choices: int | None = None
+    best_of: int | None = None
+    log_probs: bool | None = None
+    top_logprobs: int | None = None
+    cache_prompt: bool | None = None
+    reasoning_effort: int | None = None
+    reasoning_tokens: int | None = None
+    reasoning_history: bool | None = None
+    response_format: str | None = None
+    parallel_tool_calls: bool | None = None
+    max_tool_output: int | None = None
+    internal_tools: bool | None = None


could we factorize with the other model classes?

we could but i feel like the configs are not really related and it would add complexity. I would prefer having a bit of repetition for better clarity.

we'll move everything to inspect anyway no?

src/lighteval/tasks/tasks/ifeval/main.py

src/lighteval/tasks/tasks/agieval.py

src/lighteval/tasks/tasks/gpqa.py

clefourrier · 2025-10-31T12:10:38Z

src/lighteval/tasks/lighteval_task.py


+    # Inspect AI compatible parameters
+    solver: None = None
+    scorer: None = None


Would make sense to factorize to avoid having 2 different ways to launch evals, it will mess up the source of truth

factorize with current metrics? it would be quite messy i think. having it separate for now seems better as we will in term only use the scorer mechanic!

clefourrier · 2025-10-31T12:11:16Z

src/lighteval/main_inspect.py

+def results_to_markdown_table(
+    results_per_model_per_task,
+    metric: str = "accuracy",
+    stderr_metric: str = "stderr",
+    max_total_columns: int | None = None,
+    means_only_task_threshold: int = 10,
+) -> str:
+    cols = _collect_columns(results_per_model_per_task, means_only_task_threshold, max_total_columns)
+
+    writer = MarkdownTableWriter()
+    writer.headers = ["Model"] + cols
+
+    rows = []
+    for model in sorted(results_per_model_per_task.keys()):
+        row = [model]
+        data = results_per_model_per_task[model]
+        for col in cols:
+            row.append(_format_metric_cell(data, col, metric, stderr_metric))
+        rows.append(row)
+
+    writer.value_matrix = rows
+    return writer.dumps()


could you reuse the output functions we already have?

NathanHB and others added 30 commits October 7, 2025 15:09

use inspect-ai to evaluate aime25 and gsm8k

2696a49

revert file

578d530

working for 3 tasks

21fa870

parallel evals of tasks

27b2af1

adds gpqa diamond to inspect

b9a610d

move tasks to individual files

25c1128

move tasks to individual files

0d42edf

enable extended tasks as well

6cc3c04

run precomit hook

4c38951

fix mkqa

d2fd5e1

chaange extended suite to lighteval

2ddb0f9

chaange extended suite to lighteval

ee97122

add metdata to tasks

e2c8e22

add metdata to tasks

c980ddb

remove license notice and put docstring on top of file

57fe390

homogenize tags

ee081f2

add docstring for all multilingual tasks

1ed1602

add docstring for all multilingual tasks

f4b0e27

add name and dataset to metadata

81d9e4e

use TASKS_TABLE for multilingual tasks

b734532

use TASKS_TABLE for default tasks

c3911fc

use TASKS_TABLE for default tasks

e439f70

loads all tasks correclty

6447ee7

move community tasks to default tasks and update doc

88754bf

move community tasks to default tasks and update doc

5445f5c

Merge remote-tracking branch 'origin/main' into nathan-reorg-tasks

f53bd76

revert uneeded changes

6a0c615

fix doc build

1435e38

fix doc build

15f41f2

remove custom tasks and let user decide if loading multilingual tasks

74e5c0f

NathanHB changed the base branch from main to nathan-reorg-tasks October 20, 2025 19:15

NathanHB marked this pull request as draft October 20, 2025 19:16

NathanHB added 3 commits October 29, 2025 10:23

add tasks

ade2900

add gpqa

079ceaf

make model config work

8d00799

NathanHB marked this pull request as ready for review October 29, 2025 11:05

NathanHB requested a review from Copilot October 29, 2025 11:08

Copilot AI reviewed Oct 29, 2025

View reviewed changes

NathanHB and others added 3 commits October 29, 2025 12:12

Update src/lighteval/metrics/metrics.py

cea5e99

Co-authored-by: Copilot <[email protected]>

init

fb47bb7

Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…

2736bc9

…hteval into nathan-move-to-inspectai

NathanHB changed the base branch from nathan-reorg-tasks to main October 30, 2025 12:37

NathanHB and others added 6 commits October 30, 2025 13:41

Merge branch 'main' into nathan-move-to-inspectai

d5e6c9f

fix tests

e55a9af

Merge branch 'nathan-move-to-inspectai' of github.com:huggingface/lig…

ba41f1c

…hteval into nathan-move-to-inspectai

fix tests

59c5dcc

fix tests

40254db

fix tests

53275fe

clefourrier reviewed Oct 30, 2025

View reviewed changes

src/lighteval/metrics/utils/extractive_match_utils.py Show resolved Hide resolved

src/lighteval/metrics/metrics.py Show resolved Hide resolved

src/lighteval/tasks/tasks/hle/main.py Show resolved Hide resolved

NathanHB added 2 commits October 30, 2025 16:36

add correct system prompt for hle

72e5c2b

add correct system prompt for hle

7fc1753

clefourrier approved these changes Oct 31, 2025

View reviewed changes

NathanHB added 6 commits November 3, 2025 11:43

review suggestions

260d744

add doc

835b799

change buttons

c216a27

change buttons

21e6020

change buttons

7e65400

move benchmark finder to openeval org

0a4f6be

Adds inspectai #1022

Are you sure you want to change the base?

Adds inspectai #1022

Conversation

NathanHB commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

run llama3.1-8b using all providers on hf-inference-providers on gpqa, agieval and aime25:

compare few shots diff on gsm8k

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Key Changes

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

HuggingFaceDocBuilderDev commented Oct 30, 2025

Uh oh!

clefourrier left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

NathanHB commented Oct 30, 2025

Uh oh!

clefourrier left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

NathanHB commented Oct 20, 2025 •

edited

Loading

run llama3.1-8b using all providers on `hf-inference-providers` on `gpqa`, `agieval` and `aime25`:

clefourrier left a comment •

edited

Loading