compares sample to sample when doing slow tests#965
Closed
Conversation
* removed unused params * fix issue with task function
* Delete wrong instruction in custom task docs * Delete wrong code for custom tasks * Delete wrong code for extended tasks * Delete wrong code for community tasks * Delete unnecessary code for community tasks
* Fix paramater -> parameter * Fix pannel -> panel * Fix refenrence -> reference
* recommended to use concrete version instead of latest * ruff style --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* init * adding serverless endpoints back * updated tests
* Fix definition of public in docstring * Fix push_to_tensorboard param name in docstring * Fix docstring style * Add EvaluationTracker to docs * Fix docstring style * Move docstring to class header * Add attributes to docstring * Fix style * Fix style * Fix style * Fix style * Fix style * Fix style * Fix internal links in docstring
* Remove unnecessary deepcopy in evaluation_tracker * Fix style
Fix alghafa prompt function by explicitly determining the list of choices based on task_name. (Not all subsets of AlGhafa Native share same columns) --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
* fix: LightevalTaskConfig.stop_sequence attribute * fix: linter --------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
This PR enables running inference using any model provider supported by litellm as well as using litellm for llm as a judge. --------- Co-authored-by: Egor Lebedev <egor.lebe@inbox.ru> Co-authored-by: Kryvich <44714498+Kryuski@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> Co-authored-by: Nazim Ali <nazimali@gmail.com> Co-authored-by: vsabolcec <60775189+vsabolcec@users.noreply.github.com> Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Nathan Habib <nathan.habib@huggingface.co> Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
This PR does 3 things:
Provide an homogeneized API for people to use to provide model generation parameters in model configs. Those parameters are notably provided to all models which can take them (vllm, open ai, tgi, transformers, ...)
Renames BaseModel to TransformersModel
Also allows TransformersModels to use a transformers.GenerationConfig object directly, when created programmatically
I would put system_prompt, fewshot_seeds, and use_chat_template in the GenerationParameters too since they are generation parameters logically, but it can be another PR
---------
Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com>
Co-authored-by: Albert Villanova del Moral <8515462+albertvillanova@users.noreply.github.com>
--------- Co-authored-by: Nathan Habib <30601243+NathanHB@users.noreply.github.com> Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Made litellm judge backend more robust. * Added failed flag to ModelResponse. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* extract matching * better docstring * lazy imports * bump up math * Update src/lighteval/metrics/dynamic_metrics.py Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * fix pr commnets * Apply suggestions from code review Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * rename comparisson -> comparison --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* extract matching * better docstring * lazy imports * bump up math * Update src/lighteval/metrics/dynamic_metrics.py Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * fix pr commnets * Apply suggestions from code review Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com> * rename comparisson -> comparison * fix expr numbers extraction with currency or units * add test for correct extraction of failed answer --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
* Made litellm judge backend more robust. * Added failed flag to ModelResponse. * Fixed wrong model response. * Removed model response and replaced with string. --------- Co-authored-by: Clémentine Fourrier <22726840+clefourrier@users.noreply.github.com>
…/lighteval into nathan-add-integration-tests
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR enhances slow tests to enable sample-by-sample comparison by comparing individual sample details against reference outputs, going beyond just checking high-level aggregate metrics.
- Introduces a new sample comparison module that compares model responses, metrics, and document information at the individual sample level
- Modifies test execution to enable detailed logging and return both results and sample details
- Updates pipeline modules to return detailed sample information alongside standard results
Reviewed Changes
Copilot reviewed 14 out of 68 changed files in this pull request and generated 7 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/slow_tests/sample_comparison.py | New module implementing sample-by-sample comparison logic and formatting |
| tests/slow_tests/test_vllm_model.py | Enhanced to use sample comparison with detailed logging enabled |
| tests/slow_tests/test_accelerate_vlm_model.py | Enhanced to use sample comparison with detailed logging enabled |
| tests/slow_tests/test_accelerate_model.py | Enhanced to use sample comparison with detailed logging enabled |
| src/lighteval/pipeline.py | Added method to retrieve detailed sample information |
| src/lighteval/main_vllm.py | Modified to return both results and details |
| src/lighteval/main_accelerate.py | Modified to return both results and details |
| src/lighteval/main_sglang.py | Modified to return both results and details |
| src/lighteval/tasks/default_tasks.py | Fixed HuggingFace repository path for GSM8K dataset |
| examples/model_configs/vllm_model_config.yaml | Changed temperature from 0.1 to 0.0 for deterministic results |
| .gitattributes | Added LFS tracking for reference detail parquet files |
| tests/reference_details/ | Added reference detail parquet files for vision model tests |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
…/lighteval into nathan-add-integration-tests
37e0098 to
5eba9f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.