-
Notifications
You must be signed in to change notification settings - Fork 370
Adds inspectai #1022
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Adds inspectai #1022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR adds integration with the Inspect AI framework to enable evaluation using Inspect AI's data model and scorers. It introduces support for new task configuration fields (solver, scorer, sample_fields, sample_to_fewshot, filter) and implements Inspect AI-compatible scorers for math and multiple-choice evaluations.
Key Changes
- Added Inspect AI-compatible scorers (
math_scorer,multichoice_scorer) and custom task scorers for IFEval and IFBench - Introduced new task configuration fields to support Inspect AI's Sample-based evaluation flow
- Modified the
get_extraction_regexesfunction signature to acceptlen_choicesas a parameter instead of extracting it from aDocobject - Added
InspectAIModelConfigclass to support Inspect AI model configuration
Reviewed Changes
Copilot reviewed 13 out of 13 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| src/lighteval/tasks/tasks/ifeval/main.py | Adds Inspect AI scorer and sample conversion for IFEval task evaluation |
| src/lighteval/tasks/tasks/ifbench/main.py | Adds Inspect AI scorer and sample conversion for IFBench task evaluation |
| src/lighteval/tasks/tasks/hle/main.py | Adds Inspect AI sample conversion and model-graded fact checker for HLE task |
| src/lighteval/tasks/tasks/gsm_plus.py | Adds math scorer with prompt template and sample conversion for GSM Plus task |
| src/lighteval/tasks/tasks/gsm8k.py | Adds math scorer with prompt template and sample conversion for GSM8K task |
| src/lighteval/tasks/tasks/gpqa.py | Adds multiple-choice solver and choice scorer with random answer shuffling for GPQA task |
| src/lighteval/tasks/tasks/aime.py | Adds math scorer with prompt template and sample conversion for AIME task |
| src/lighteval/tasks/tasks/agieval.py | Adds multiple-choice solver and choice scorer with sample conversion for AGIEval task |
| src/lighteval/tasks/lighteval_task.py | Adds Inspect AI compatible configuration fields to LightevalTaskConfig |
| src/lighteval/models/abstract_model.py | Adds InspectAIModelConfig class for Inspect AI model configuration |
| src/lighteval/metrics/utils/extractive_match_utils.py | Refactors get_extraction_regexes to accept len_choices parameter instead of Doc object |
| src/lighteval/metrics/metrics.py | Implements math_scorer and multichoice_scorer for Inspect AI integration |
| src/lighteval/main.py | Adds Inspect AI evaluation backend command |
Comments suppressed due to low confidence (1)
src/lighteval/metrics/metrics.py:108
- This statement is unreachable.
return Score(value=1)
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <[email protected]>
…hteval into nathan-move-to-inspectai
|
The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you need a review at this stage?
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, a couple nits atm, but cool work, looking forward to how the code base will simplify with inspect!
You're also missing a doc update for the whole new feature
| def multichoice_scorer(): | ||
| language = Language.ENGLISH | ||
| gold_extraction_target = ( | ||
| IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True), | ||
| ) | ||
| pred_extraction_target = ( | ||
| IndicesExtractionConfig(prefix_for_extraction="NativeLetters", try_extract_without_anchor=True), | ||
| ) | ||
| fallback_mode = "first_match" | ||
| extraction_mode = "first_match" | ||
| timeout_seconds = 5 | ||
|
|
||
| gold_extraction_regexes = get_extraction_regexes_inspect(gold_extraction_target, language) | ||
| pred_extraction_regexes = get_extraction_regexes_inspect(pred_extraction_target, language) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this behavior of nested functions behaving as classes is really meh for legibility, customizability and maintenability
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
definetely could be better ! but that's how inspect is expecting it. Will work on a better format once we start having more metrics compatible with it.
| max_tokens: int | None = None | ||
| system_message: str | None = None | ||
| temperature: float | None = None | ||
| top_p: float | None = None | ||
| top_k: int | None = None | ||
| frequence_penalty: float | None = None | ||
| presence_penalty: float | None = None | ||
| seed: int | None = None | ||
| stop_seqs: list[str] | None = None | ||
| num_choices: int | None = None | ||
| best_of: int | None = None | ||
| log_probs: bool | None = None | ||
| top_logprobs: int | None = None | ||
| cache_prompt: bool | None = None | ||
| reasoning_effort: int | None = None | ||
| reasoning_tokens: int | None = None | ||
| reasoning_history: bool | None = None | ||
| response_format: str | None = None | ||
| parallel_tool_calls: bool | None = None | ||
| max_tool_output: int | None = None | ||
| internal_tools: bool | None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could we factorize with the other model classes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we could but i feel like the configs are not really related and it would add complexity. I would prefer having a bit of repetition for better clarity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we'll move everything to inspect anyway no?
|
|
||
| # Inspect AI compatible parameters | ||
| solver: None = None | ||
| scorer: None = None |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would make sense to factorize to avoid having 2 different ways to launch evals, it will mess up the source of truth
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
factorize with current metrics? it would be quite messy i think. having it separate for now seems better as we will in term only use the scorer mechanic!
| def results_to_markdown_table( | ||
| results_per_model_per_task, | ||
| metric: str = "accuracy", | ||
| stderr_metric: str = "stderr", | ||
| max_total_columns: int | None = None, | ||
| means_only_task_threshold: int = 10, | ||
| ) -> str: | ||
| cols = _collect_columns(results_per_model_per_task, means_only_task_threshold, max_total_columns) | ||
|
|
||
| writer = MarkdownTableWriter() | ||
| writer.headers = ["Model"] + cols | ||
|
|
||
| rows = [] | ||
| for model in sorted(results_per_model_per_task.keys()): | ||
| row = [model] | ||
| data = results_per_model_per_task[model] | ||
| for col in cols: | ||
| row.append(_format_metric_cell(data, col, metric, stderr_metric)) | ||
| rows.append(row) | ||
|
|
||
| writer.value_matrix = rows | ||
| return writer.dumps() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you reuse the output functions we already have?
adds inspect-ai as backend for lighteval! Offloading backend implementation and maintenance
tasks compatible with inspect ai (at term all the tasks will be compatible):
run llama3.1-8b using all providers on
hf-inference-providersongpqa,agievalandaime25:result:
compare few shots diff on gsm8k