compares sample to sample when doing slow tests by NathanHB · Pull Request #965 · huggingface/lighteval

NathanHB · 2025-09-16T09:30:49Z

No description provided.

* removed unused params * fix issue with task function

* Delete wrong instruction in custom task docs * Delete wrong code for custom tasks * Delete wrong code for extended tasks * Delete wrong code for community tasks * Delete unnecessary code for community tasks

* Fix paramater -> parameter * Fix pannel -> panel * Fix refenrence -> reference

Implement TGI model config from path: ```python TGIModelConfig.from_path(model_config_path) ``` Follow-up to: - #434 Related to: - #439

* recommended to use concrete version instead of latest * ruff style --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

* init * adding serverless endpoints back * updated tests

* Fix definition of public in docstring * Fix push_to_tensorboard param name in docstring * Fix docstring style * Add EvaluationTracker to docs * Fix docstring style * Move docstring to class header * Add attributes to docstring * Fix style * Fix style * Fix style * Fix style * Fix style * Fix style * Fix internal links in docstring

* Remove unnecessary deepcopy in evaluation_tracker * Fix style

Fix alghafa prompt function by explicitly determining the list of choices based on task_name. (Not all subsets of AlGhafa Native share same columns) --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

…istry (#466) * Fix precedence of default tasks over custom ones in registry * Revert "Fix precedence of default tasks over custom ones in registry" This reverts commit 8125ea2. * Fix comment/warning about precedence of custom over default tasks

* fix: LightevalTaskConfig.stop_sequence attribute * fix: linter --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

This PR enables running inference using any model provider supported by litellm as well as using litellm for llm as a judge. --------- Co-authored-by: Egor Lebedev <egor.lebe@inbox.ru> Co-authored-by: Kryvich <44714498+Kryuski@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Nazim Ali <nazimali@gmail.com> Co-authored-by: vsabolcec <60775189+vsabolcec@users.noreply.github.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.co> Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

fixes #447

This PR does 3 things: Provide an homogeneized API for people to use to provide model generation parameters in model configs. Those parameters are notably provided to all models which can take them (vllm, open ai, tgi, transformers, ...) Renames BaseModel to TransformersModel Also allows TransformersModels to use a transformers.GenerationConfig object directly, when created programmatically I would put system_prompt, fewshot_seeds, and use_chat_template in the GenerationParameters too since they are generation parameters logically, but it can be another PR --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>

--------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* Made litellm judge backend more robust. * Added failed flag to ModelResponse. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* Fix T_co import bug * Fix styling

* extract matching * better docstring * lazy imports * bump up math * Update src/lighteval/metrics/dynamic_metrics.py Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * fix pr commnets * Apply suggestions from code review Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * rename comparisson -> comparison --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* extract matching * better docstring * lazy imports * bump up math * Update src/lighteval/metrics/dynamic_metrics.py Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * fix pr commnets * Apply suggestions from code review Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * rename comparisson -> comparison * fix expr numbers extraction with currency or units * add test for correct extraction of failed answer --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

* Made litellm judge backend more robust. * Added failed flag to ModelResponse. * Fixed wrong model response. * Removed model response and replaced with string. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

…/lighteval into nathan-add-integration-tests

Copilot

Pull Request Overview

This PR enhances slow tests to enable sample-by-sample comparison by comparing individual sample details against reference outputs, going beyond just checking high-level aggregate metrics.

Introduces a new sample comparison module that compares model responses, metrics, and document information at the individual sample level
Modifies test execution to enable detailed logging and return both results and sample details
Updates pipeline modules to return detailed sample information alongside standard results

Reviewed Changes

Copilot reviewed 14 out of 68 changed files in this pull request and generated 7 comments.

Show a summary per file

File	Description
tests/slow_tests/sample_comparison.py	New module implementing sample-by-sample comparison logic and formatting
tests/slow_tests/test_vllm_model.py	Enhanced to use sample comparison with detailed logging enabled
tests/slow_tests/test_accelerate_vlm_model.py	Enhanced to use sample comparison with detailed logging enabled
tests/slow_tests/test_accelerate_model.py	Enhanced to use sample comparison with detailed logging enabled
src/lighteval/pipeline.py	Added method to retrieve detailed sample information
src/lighteval/main_vllm.py	Modified to return both results and details
src/lighteval/main_accelerate.py	Modified to return both results and details
src/lighteval/main_sglang.py	Modified to return both results and details
src/lighteval/tasks/default_tasks.py	Fixed HuggingFace repository path for GSM8K dataset
examples/model_configs/vllm_model_config.yaml	Changed temperature from 0.1 to 0.0 for deterministic results
.gitattributes	Added LFS tracking for reference detail parquet files
tests/reference_details/	Added reference detail parquet files for vision model tests

_{Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.}

tests/slow_tests/test_vllm_model.py

tests/slow_tests/test_accelerate_model.py

tests/slow_tests/sample_comparison.py

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

…/lighteval into nathan-add-integration-tests

clefourrier and others added 30 commits December 12, 2024 13:07

Fix custom arabic tasks (#440)

4640121

* removed unused params * fix issue with task function

Fix imports from model_config (#443)

914d6cf

Fix wrong instructions and code for custom tasks (#450)

a0fedaa

* Delete wrong instruction in custom task docs * Delete wrong code for custom tasks * Delete wrong code for extended tasks * Delete wrong code for community tasks * Delete unnecessary code for community tasks

Fix minor typos (#449)

859c4ad

* Fix paramater -> parameter * Fix pannel -> panel * Fix refenrence -> reference

Implement TGI model config from path (#448)

d7abcdb

Implement TGI model config from path: ```python TGIModelConfig.from_path(model_config_path) ``` Follow-up to: - #434 Related to: - #439

use concrete version instead of latest (#452)

3d4d951

* recommended to use concrete version instead of latest * ruff style --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

Adds serverless endpoints back (#445)

15234f6

* init * adding serverless endpoints back * updated tests

Upgrade deprecated GH Action cache@v2 (#456)

e0192d1

fix: CACHE_DIR Default Value in Accelerate Pipeline (#461)

d442b27

Remove unnecessary deepcopy in evaluation_tracker (#459)

d9f9b81

* Remove unnecessary deepcopy in evaluation_tracker * Fix style

Checkout PR merge commit for CI tests (#468)

7fe0734

Delete endpoint on InferenceEndpointTimeoutError (#475)

a85cf23

fix: LightevalTaskConfig.stop_sequence attribute (#463)

dd2e4ac

* fix: LightevalTaskConfig.stop_sequence attribute * fix: linter --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>

fix: scorer attribute initialization in ROUGE (#471)

7f335df

Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

fix model parallel (#481)

829f3b3

fixes #447

Tiny improvements to endpoint_model.py, base_model.py,... (#219)

57092e3

--------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Update README.md (#486)

8acc38f

Fix issue with encodings for together models. (#483)

e8c548c

Made litellm judge backend more robust. (#485)

ef1dd09

* Made litellm judge backend more robust. * Added failed flag to ModelResponse. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>

Fix T_co import bug (#484)

fe8d2da

* Fix T_co import bug * Fix styling

fix README link (#500)

3239838

Fixed issue with o1 in litellm. (#493)

07a69e1

NathanHB and others added 4 commits September 17, 2025 15:23

Merge branch 'main' into nathan-add-integration-tests

0a0f3cb

working state

70eeb9e

adding reference details

3ae1c4b

Merge branch 'nathan-add-integration-tests' of github.com:huggingface…

8955542

…/lighteval into nathan-add-integration-tests

NathanHB self-assigned this Sep 17, 2025

NathanHB added the feature label Sep 17, 2025

NathanHB added 10 commits September 17, 2025 14:36

fix logprobs compares for different harware

6eb013c

fix logprobs compares for different harware

6068ec8

revert undeed changes

cdc9f45

revert undeed changes

88d0392

get actual samples

a2d4267

get actual samples

ecf14b9

modify sample to have temp = 0

1b39fcd

compare logprobs ranking instead of values

24a20c3

add samples compare for vlm

b2cce05

only compare the text results

6b6af7a

NathanHB requested a review from Copilot September 19, 2025 12:04

Copilot AI reviewed Sep 19, 2025

View reviewed changes

NathanHB and others added 10 commits September 19, 2025 14:07

Apply suggestion from @Copilot

3dd4338

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

291178d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

4dfcc18

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Apply suggestion from @Copilot

e6f86b5

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/slow_tests/sample_comparison.py

d1d556d

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

Update tests/slow_tests/sample_comparison.py

4066608

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

use math.iscloze for metrics and fix path to vlm details

c1af85a

Merge branch 'nathan-add-integration-tests' of github.com:huggingface…

2f683f4

…/lighteval into nathan-add-integration-tests

dont log to cli when doing slow tests and log nvidia smi

b58b063

fix vlm details

5eba9f3

NathanHB closed this Sep 19, 2025

NathanHB force-pushed the nathan-add-integration-tests branch from 37e0098 to 5eba9f3 Compare September 19, 2025 13:32

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compares sample to sample when doing slow tests#965

compares sample to sample when doing slow tests#965
NathanHB wants to merge 509 commits intomainfrom
nathan-add-integration-tests

NathanHB commented Sep 16, 2025

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

NathanHB commented Sep 16, 2025

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants