Skip to content

compares sample to sample when doing slow tests#965

Closed
NathanHB wants to merge 509 commits intomainfrom
nathan-add-integration-tests
Closed

compares sample to sample when doing slow tests#965
NathanHB wants to merge 509 commits intomainfrom
nathan-add-integration-tests

Conversation

@NathanHB
Copy link
Member

No description provided.

clefourrier and others added 30 commits December 12, 2024 13:07
* removed unused params

* fix issue with task function
* Delete wrong instruction in custom task docs

* Delete wrong code for custom tasks

* Delete wrong code for extended tasks

* Delete wrong code for community tasks

* Delete unnecessary code for community tasks
* Fix paramater -> parameter

* Fix pannel -> panel

* Fix refenrence -> reference
Implement TGI model config from path:
```python
TGIModelConfig.from_path(model_config_path)
```

Follow-up to:
- #434 

Related to:
- #439
* recommended to use concrete version instead of latest

* ruff style

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* init

* adding serverless endpoints back

* updated tests
* Fix definition of public in docstring

* Fix push_to_tensorboard param name in docstring

* Fix docstring style

* Add EvaluationTracker to docs

* Fix docstring style

* Move docstring to class header

* Add attributes to docstring

* Fix style

* Fix style

* Fix style

* Fix style

* Fix style

* Fix style

* Fix internal links in docstring
* Remove unnecessary deepcopy in evaluation_tracker

* Fix style
Fix alghafa prompt function by explicitly determining the list of choices based on task_name. 
(Not all subsets of AlGhafa Native share same columns)

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
…istry (#466)

* Fix precedence of default tasks over custom ones in registry

* Revert "Fix precedence of default tasks over custom ones in registry"

This reverts commit 8125ea2.

* Fix comment/warning about precedence of custom over default tasks
* fix: LightevalTaskConfig.stop_sequence attribute

* fix: linter

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
This PR enables running inference using any model provider supported by litellm as well as using litellm for llm as a judge.

---------

Co-authored-by: Egor Lebedev <egor.lebe@inbox.ru>
Co-authored-by: Kryvich <44714498+Kryuski@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
Co-authored-by: Nazim Ali <nazimali@gmail.com>
Co-authored-by: vsabolcec <60775189+vsabolcec@users.noreply.github.com>
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Nathan Habib <nathan.habib@huggingface.co>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
This PR does 3 things:

    Provide an homogeneized API for people to use to provide model generation parameters in model configs. Those parameters are notably provided to all models which can take them (vllm, open ai, tgi, transformers, ...)
    Renames BaseModel to TransformersModel
    Also allows TransformersModels to use a transformers.GenerationConfig object directly, when created programmatically

I would put system_prompt, fewshot_seeds, and use_chat_template in the GenerationParameters too since they are generation parameters logically, but it can be another PR

---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
---------

Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Made litellm judge backend more robust.

* Added failed flag to ModelResponse.

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Fix T_co import bug

* Fix styling
* extract matching

* better docstring

* lazy imports

* bump up math

* Update src/lighteval/metrics/dynamic_metrics.py

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* fix pr commnets

* Apply suggestions from code review

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* rename comparisson -> comparison

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* extract matching

* better docstring

* lazy imports

* bump up math

* Update src/lighteval/metrics/dynamic_metrics.py

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* fix pr commnets

* Apply suggestions from code review

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* rename comparisson -> comparison

* fix expr numbers extraction with currency or units

* add test for correct extraction of failed answer

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Made litellm judge backend more robust.

* Added failed flag to ModelResponse.

* Fixed wrong model response.

* Removed model response and replaced with string.

---------

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
@NathanHB NathanHB self-assigned this Sep 17, 2025
@NathanHB NathanHB requested a review from Copilot September 19, 2025 12:04
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR enhances slow tests to enable sample-by-sample comparison by comparing individual sample details against reference outputs, going beyond just checking high-level aggregate metrics.

  • Introduces a new sample comparison module that compares model responses, metrics, and document information at the individual sample level
  • Modifies test execution to enable detailed logging and return both results and sample details
  • Updates pipeline modules to return detailed sample information alongside standard results

Reviewed Changes

Copilot reviewed 14 out of 68 changed files in this pull request and generated 7 comments.

Show a summary per file
File Description
tests/slow_tests/sample_comparison.py New module implementing sample-by-sample comparison logic and formatting
tests/slow_tests/test_vllm_model.py Enhanced to use sample comparison with detailed logging enabled
tests/slow_tests/test_accelerate_vlm_model.py Enhanced to use sample comparison with detailed logging enabled
tests/slow_tests/test_accelerate_model.py Enhanced to use sample comparison with detailed logging enabled
src/lighteval/pipeline.py Added method to retrieve detailed sample information
src/lighteval/main_vllm.py Modified to return both results and details
src/lighteval/main_accelerate.py Modified to return both results and details
src/lighteval/main_sglang.py Modified to return both results and details
src/lighteval/tasks/default_tasks.py Fixed HuggingFace repository path for GSM8K dataset
examples/model_configs/vllm_model_config.yaml Changed temperature from 0.1 to 0.0 for deterministic results
.gitattributes Added LFS tracking for reference detail parquet files
tests/reference_details/ Added reference detail parquet files for vision model tests

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

NathanHB and others added 10 commits September 19, 2025 14:07
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…/lighteval into nathan-add-integration-tests
@NathanHB NathanHB closed this Sep 19, 2025
@NathanHB NathanHB force-pushed the nathan-add-integration-tests branch from 37e0098 to 5eba9f3 Compare September 19, 2025 13:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.