Releases: huggingface/lighteval
v0.12.0
v0.12
Exciting release in which we pivot into using inspect-ai as backend and make tasks much easier to find and add thanks to a finder space: here
New Features 🎉
- Registry refactorisation by @clefourrier in #937
- Multilingual extractiveness by @rolshoven in #956
- Added
backend_optionsparameter to llm judges. by @rolshoven in #963 - Add automatic tests for metrics by @NathanHB in #939
- Support local GGUF in VLLM and use HF tokenizer #943 by @JIElite in #972
- [RFC] Rework the dependencies to be more versatile by @LysandreJik in #951
- Sample to sample compare for integration tests by @NathanHB in #977
- Move tasks to individual files by @NathanHB in #1016
- Adds inspectai by @NathanHB in #1022
New Tasks
- GSM-PLUS by @NathanHB in #780
- TUMLU-mini by @ceferisbarov in #811
- Filipino Benchmark by @ljvmiranda921 in #852
- MMLU Redux by @clefourrier in #883
- IFBench by @clefourrier in #944
- SLR-Bench by @Ahmad21Omar in #983
- MMLU pro by @NathanHB in #1031
Enhancement ⚙️
- adds
enable_prefix_cachingoption to VLLMModelConfig by @GAD-cell in #945 - Added litellm model config options and improved
_prepare_max_new_tokensby @rolshoven in #967 - always provide parameters in the metric name to allow using several combinations by @clefourrier in #1017
Documentation 📚
- Add org_to_bill parameter to documentation by @tfrere in #781
- Update docs and enforces google's docstring style by @NathanHB in #941
- Fix broken link by @JoelNiklaus in #1014
- Update huggingface-cli login to use newer hf auth login by @Xceron in #1034
Task and Metrics changes 🛠️
- Add Bulgarian and Macedonian literals by @dianaonutu in #769
- Add TranslationLiterals for Language.DANISH by @spyysalo in #770
- Update translation_literals.py with icelandic by @joenaess in #775
- Complete TranslationLiterals for Language.ESTONIAN by @spyysalo in #779
- Update translation_literals.py by @dianaonutu in #923
- Fixing naming for sample evals + adding reqs in aime24 by @clefourrier in #989
- add translation literals for various Indic languages (Bengali, Gujarati, Punjabi, Tamil) by @rpm000 in #1015
Bug Fixes 🐛
- [#794] Fix: Assign SummaCZS instance to
self.summacin Faithfulness metric by @sahilds1 in #795 - Catch ROCM/HIP/AMD oom in
should_reduce_batch_sizeby @mcleish7 in #812 - Fix GPQA and index extractive metric by @clefourrier in #829
- Update extractive_match_utils.py for words where
:is preceded by a space by @clefourrier in #831 - fixes from_model function and adds tests by @NathanHB in #921
- fix tasks list by @alielfilali01 in #906
- set upper bound on vllm version by @NathanHB in #964
- Fixed bug that prevented the metrics from being mixed (batched/not batched) by @rolshoven in #958
- Fix inference providers calls by @clefourrier in #1012
- Fixing mixeval by @clefourrier in #1006
- Fix typo in attribute name: CONCURENT_CALLS -> CONCURRENT_CALLS by @muupan in #884
- Added ability to configure concurrent_requests in litellm_model.py by @dameikle in #911
- Added fallback for incomplete configs for vlm models launched as llms by @clefourrier in #828
New Contributors
- @pratyushmaini made their first contribution in #697
- @DeVikingMark made their first contribution in #782
- @sahilds1 made their first contribution in #795
- @dianaonutu made their first contribution in #769
- @tfrere made their first contribution in #781
- @mcleish7 made their first contribution in #812
- @leopardracer made their first contribution in #810
- @spyysalo made their first contribution in #770
- @ceferisbarov made their first contribution in #811
- @joenaess made their first contribution in #775
- @ryantzr1 made their first contribution in #784
- @dtung8068 made their first contribution in #862
- @muupan made their first contribution in #884
- @NouamaneTazi made their first contribution in #841
- @uralik made their first contribution in #887
- @dameikle made their first contribution in #911
- @ljvmiranda921 made their first contribution in #852
- @cpcdoy made their first contribution in #502
- @rolshoven made their first contribution in #958
- @JIElite made their first contribution in #972
- @LysandreJik made their first contribution in #951
- @GAD-cell made their first contribution in #945
- @amstu2 made their first contribution in #986
- @Ahmad21Omar made their first contribution in #983
- @cmpatino made their first contribution in #998
- @rpm000 made their first contribution in #1015
- @Xceron made their first contribution in #1034
Full Changelog: v0.10.0...v0.12.0
V0.11.0
Lighteval v0.11.0
This release introduces major improvements and changes, across usability, stability, performance and documentation.
Highlights include a large refactor to simplify the architecture, automated metric tests, a dependency rework, improved documentation, and new tasks/benchmarks.
Highlights
- Automated tests for metrics and stronger dependency checks
- Continuous batching, caching, and faster CLI with reduced redundancy
- Upgrade to datasets 4.0 and Trackio integration
- Automatic chat template inference and reasoning trace support
- New tasks: GSM-PLUS, TUMLU-mini, IFBench, Filipino benchmarks, MMLU Redux
- Added Bulgarian, Macedonian, Danish, Icelandic, and Estonian literals
- Documentation improvements (Google docstring style, README updates)
What's Changed
New Features
- Automatic inference of chat template usage (no kwargs needed) by @clefourrier (#885)
- More versatile dependency rework by @LysandreJik (#951)
- Automatic tests for metrics by @NathanHB (#939)
- Sample-to-sample comparisons for integration tests by @NathanHB (#977)
- Continuous batching support by @NathanHB (#850) (arthur)
- Refactored code and removed unused parts by @NathanHB (#709)
- Post-processing for reasoning tokens in pipeline by @clefourrier (#882)
- logging of system prompt by @clefourrier (#907)
- Adds Caching of samples by @clefourrier (#909)
- Upgrade to
datasets4.0 by @NathanHB (#924) - Trackio integration when available by @NathanHB (#930)
- Parameterization of sampling evals from CLI by @clefourrier (#926)
- Local GGUF support in VLLM with HF tokenizer by @JIElite (#972)
Enhancement
bootstrap_itersas an argument by @pratyushmaini (#697)- Load tasks before models by @clefourrier (#931)
- Save
reasoning_contentfrom litellm as details by @muupan (#929) - Fix for
TGIendpoint inference and JSON grammar generation by @cpcdoy (#502) - Reduced redundancy in CLI arguments by @NathanHB (#932)
- Registry refactor by @clefourrier (#937)
- Multilingual extractiveness support by @rolshoven (#956)
- Added
backend_optionsparameter to LLM judges by @rolshoven (#963)
Documentation
- Added
org_to_billparameter by @tfrere (#781) - Updated docs with Google docstring style by @NathanHB (#941)
- Updated README by @NathanHB (#961)
New Tasks
- Added GSM-PLUS by @NathanHB (#780)
- Added TUMLU-mini benchmark, fixed #577 by @ceferisbarov (#811)
- Added Filipino benchmark community tasks by @ljvmiranda921 (#852)
- MMLU Redux and caching fix by @clefourrier (#883)
- Added IFBench by @clefourrier (#944)
Task and Metrics Changes
- Added Bulgarian and Macedonian literals by @dianaonutu (#769)
- Added Danish translation literals by @spyysalo (#770)
- Added Icelandic translation literals by @joenaess (#775)
- Completed Estonian translation literals by @spyysalo (#779)
- Updated
translation_literals.pyby @dianaonutu (#923)
Bug Fixes
- Fixed [#794]: assigned
SummaCZSinstance in Faithfulness metric by @sahilds1 (#795) - Caught ROCM/HIP/AMD OOM in
should_reduce_batch_sizeby @mcleish7 (#812) - Fixed GPQA and index extractive metric by @clefourrier (#829)
- Updated
extractive_match_utils.pyfor cases with:by @clefourrier (#831) - Fixed
from_modelfunction and added tests by @NathanHB (#921) - Fixed tasks list by @alielfilali01 (#906)
- Set upper bound on VLLM version by @NathanHB (#964)
- Fixed batching bug in metrics by @rolshoven (#958)
Other Changes
- Fixed typo in attribute name (
CONCURENT_CALLS→CONCURRENT_CALLS) by @muupan (#884) - Added ability to configure
concurrent_requestsinlitellm_model.pyby @dameikle (#911)
New Contributors
We’re excited to welcome new contributors in this release:
@pratyushmaini, @DeVikingMark, @sahilds1, @dianaonutu, @tfrere, @mcleish7, @leopardracer, @spyysalo, @ceferisbarov, @joenaess, @ryantzr1, @dtung8068, @muupan, @NouamaneTazi, @uralik, @dameikle, @ljvmiranda921, @cpcdoy, @rolshoven, @JIElite, @LysandreJik
Full Changelog: v0.10.0...v0.11.0
v0.10.0
We now support VLM when using transformers backend 🥳
What's Changed
New Features 🎉
- Added support for quantization in vLLM backend by @SulRash in #690
- Adds multimodal support and MMMU pro by @NathanHB in #675
- Allow for model kwargs when loading transformers from pretrained by @NathanHB in #754
- Adds template for custom path saving results by @NathanHB in #755
- Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
- Async vllm by @clefourrier in #693
New Tasks
- Adds More Generative tasks by @hynky1999 in #694
- Added Flores by @clefourrier in #717
Task and Metrics changes 🛠️
- Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
- add livecodebench v6 by @Cppowboy in #712
- Add MCQ support to Yourbench evaluation by @alozowski in #734
Other Changes
- Bump ruff version by @NathanHB in #774
- Fix revision arg for vLLM tokenizer by @lewtun in #721
- Update README.md by @clefourrier in #733
- Fix litellm by @NathanHB in #736
New Contributors
- @Cppowboy made their first contribution in #712
- @SulRash made their first contribution in #690
- @Abelgurung made their first contribution in #743
Full Changelog: v0.9.2...v0.10.0
v0.9.2
What's Changed
New Features 🎉
- enable together models and reasoning models as judges. by @JoelNiklaus in #537
- Propagate vLLM batch size controls by @alvin319 in #588
- Integrate huggingface_hub inference support for LLM as Judge by @alozowski in #651
- add cot_prompt in vllm by @HERIUN in #654
- Unify modelargs and use Pydantic for model configs by @NathanHB in #609
- Improve test by @qubvel in #674
- adds wandb loging of metrics by @NathanHB in #676
- Adds wanddb logging by @NathanHB in #685
- Added custom model inference. by @JoelNiklaus in #437
- Update split iteration for DynamicBatchingDataset by @qubvel in #684
Documentation 📚
- Add --use-chat-template to the broken litellm example by @eldarkurtic in #614
- Lighteval math by @HERIUN in #630
- Update quicktour command by @qubvel in #679
- fix wrong 'custom_task_directory' in python api doc by @xgwang in #671
- docs: improve consistency in punctuation of metric list by @mariagrandury in #605
New Tasks 📈
- add arc agi 2 by @NathanHB in #642
- Add G-Pass@k Metric by @jnanliu in #589
- adds simpleqa by @NathanHB in #680
Task and Metrics changes 🛠️
- Pass At K Math by @clefourrier in #647
- Use
n=16samples to estimatepass@1for AIME benchmarks by @lewtun in #661 - adding uzbek literals by @shopulatov in #664
- Align AIME pass@1 with literature by @lewtun in #666
- Update LCB prompt & fix newlines by @rawsh in #645
- fix gsm8k metric by @NathanHB in #688
- Add pass@1 for GPQA-D and MATH-500 by @lewtun in #698
Bug Fixes 🐛
- Use
blfoat16as default for vllm models. by @NathanHB in #638 - Fix passing of generation config to main_accelerate by @LoserCheems in #659
- Parse seed for vLLM by @eldarkurtic in #602
- Parse string values for add_special_tokens in vLLM by @eldarkurtic in #598
- hardcode configs to not make lighteval crash if lcb repo unavailable by @NathanHB in #677
- tokenizer 'padding' param is not correct. by @xgwang in #669
- Fix TransformersModel.from_model() method by @Vectorrent in #691
- Inference providers by @clefourrier in #701
New Contributors
- @DerekLiu35 made their first contribution in #620
- @AnikiFan made their first contribution in #610
- @alvin319 made their first contribution in #588
- @alozowski made their first contribution in #643
- @Laz4rz made their first contribution in #613
- @shopulatov made their first contribution in #664
- @HERIUN made their first contribution in #654
- @rawsh made their first contribution in #645
- @qubvel made their first contribution in #674
- @xgwang made their first contribution in #669
- @jnanliu made their first contribution in #589
- @Vectorrent made their first contribution in #683
- @omahs made their first contribution in #702
Full Changelog: v0.8.0...v0.9.0
v0.8.0
What's new
Tasks
- LiveCodeBench by @plaguss in #548, #587, #518
- GPQA diamond by @lewtun in #534
- Humanity's last exam by @clefourrier in #520
- Olympiad Bench by @NathanHB in #521
- aime24, 25 and math500 by @NathanHB in #586
- french models Evals by @mdiazmel in #505
Metrics
- Pass@k by @clefourrier in #519
- Extractive Match metric by @hynky1999 in #495, #503, #522, #535
Features
Better logging
- log model config by @NathanHB in #627
- Support custom results/details push to hub by @albertvillanova in #457
- Push details without converting fields to str by @NathanHB in #572
Inference providers
Load details to be evaluated
- Implemented the possibility to load predictions from details files and continue evaluating from there by @JoelNiklaus in #488
sglang support
Bug Fixes and refacto
- Tiny improvements to
endpoint_model.py,base_model.py,... by @sadra-barikbin in #219 - Update README.md by @NathanHB in #486
- Fix issue with encodings for together models. by @JoelNiklaus in #483
- Made litellm judge backend more robust. by @JoelNiklaus in #485
- Fix
T_coimport bug by @gucci-j in #484 - fix README link by @vxw3t8fhjsdkghvbdifuk in #500
- Fixed issue with o1 in litellm. by @JoelNiklaus in #493
- Hotfix for litellm judge by @JoelNiklaus in #490
- Made judge response processing more robust. by @JoelNiklaus in #491
- VLLM: Allows for max tokens to be set in model config file by @NathanHB in #547
- Bump up the latex2sympy2_extended version + more tests by @hynky1999 in #510
- Fixed bug of import url_to_fs from fsspec by @LoserCheems in #507)
- Fix Ukrainian indices and confirmation word by @ayukh in #516
- Fix VLLM data-parallel by @hynky1999 in #541
- relax spacy import to relax dep by @clefourrier in #622
- vllm fix sampling params by @NathanHB in #625
- relax deps for tgi by @NathanHB in #626
- Bug fix extractive match by @hynky1999 in #540
- Fix loading of vllm model from files by @NathanHB in #533
- fix: broken URLs by @deep-diver in #550
- typo(vllm):
gpu_memory_utilisationtypo by @tpoisonooo in #553 - allows better flexibility for litellm endpoints by @NathanHB in #549
- Translate task template to Catalan and Galician and fix typos by @mariagrandury in #506
- Relax upper bound on torch by @lewtun in #508
- Fix vLLM generation with sampling params by @lewtun in #578
- Make BLEURT lazy by @hynky1999 in #536
- Fixing backend error in main_sglang. by @TankNee in #597
- VLLM + Math-Verify fixes by @hynky1999 in #603
- raise exception when generation size is more than model length by @NathanHB in #571
Thanks
Huge thanks to Hyneck, Lewis, Ben, Agustín, Elie and everyone helping and and giving feedback 💙
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Extractive Match metric (#495)
- Fix math extraction (#503)
- Bump up the latex2sympy2_extended version + more tests (#510)
- Math extraction - allow only trying the first match, more customizable latex extraction + bump deps (#522)
- add missing inits (#524)
- Sync Math-verify (#535)
- Make BLEURT lazy (#536)
- Bug fix extractive match (#540)
- Fix VLLM data-parallel (#541)
- VLLM + Math-Verify fixes (#603)
- @plaguss
- @Jayon02
- Let lighteval support sglang (#552)
- @NathanHB
- adds olympiad bench (#521)
- Fix loading of vllm model from files (#533)
- [VLLM] Allows for max tokens to be set in model config file (#547)
- allows better flexibility for litellm endpoints (#549)
- raise exception when generation size is more than model length (#571)
- Push details without converting fields to str (#572)
- adds aime24, 25 and math500 (#586)
- adds inference providers support (#616)
- vllm fix sampling params (#625)
- relax deps for tgi (#626)
- log model config (#627)
v0.7.0
What's New
New Tasks
- added musr by @clefourrier in #375
- Adds Global MLMU by @hynky1999 in #426
- Add new Arabic benchmarks (5) and enhance existing tasks by @alielfilali01 in #372
New Features
- Evaluate a model already loaded in memory for training / evaluation loop by @clefourrier in #390
- Allowing a single prompt to use several formats for one eval by @clefourrier in #398
- Autoscaling inference endpoints hardware by @clefourrier in #412
- CLI new look and features (using typer) by @NathanHB in #407
- Better Looking and more functional logging by @NathanHB in #415
- Add litellm backend by @JoelNiklaus in #385
More Translation Literals by the Community
- add bashkir variants by @AigizK in #374
- add Shan (shn) translation literals by @NoerNova in #376
- Add Udmurt (udm) translation literals by @codemurt in #381
- This PR adds translation literals for Belarusian language. by @Kryuski in #382
- added tatar literals by @gaydmi in #383
New Doc
- Add doc-builder doc-pr-upload GH Action by @albertvillanova in #411
- Set up docs by @albertvillanova in #403
- Add docstring docs by @albertvillanova in #413
- Add missing models to docs by @albertvillanova in #419
- Update docs about inference endpoints by @albertvillanova in #432
- Upgrade deprecated GH Action cache@v2 by @albertvillanova in #456
- Add EvaluationTracker to docs and fix its docstring by @albertvillanova in #464
- Checkout PR merge commit for CI tests by @albertvillanova in #468
Bug Fixes and Refacto
- Allow AdapterModels to have custom tokens by @mapmeld in #306
- Homogeneize generation params by @clefourrier in #428
- fix: cache directory variable by @NazimHAli in #378
- Add trufflehog secrets detection by @albertvillanova in #429
- greedy_until() fix by @vsabolcec in #344
- Fixes a TypeError for generative metrics. by @JoelNiklaus in #386
- Speed up Bootstrapping Computation by @JoelNiklaus in #409
- Fix imports from model_config by @albertvillanova in #443
- Fix wrong instructions and code for custom tasks by @albertvillanova in #450
- Fix minor typos by @albertvillanova in #449
- fix model parallel by @NathanHB in #481
- add configs with their models by @clefourrier in #421
- Fixes a TypeError in Sacrebleu. by @JoelNiklaus in #387
- fix ukr/rus by @hynky1999 in #394
- fix repeated cleanup by @anton-l in #399
- Update instance type/size in endpoint model_config example by @albertvillanova in #401
- Considering the case empty request list is given to base model by @sadra-barikbin in #250
- Fix a tiny bug in
PromptManager::FewShotSampler::_init_fewshot_sampling_randomby @sadra-barikbin in #423 - Fix splitting for generative tasks by @NathanHB in #400
- Fixes an error with getting the golds from the formatted_docs. by @JoelNiklaus in #388
- Fix ignored reuse_existing in config file by @albertvillanova in #431
- Deprecate Obsolete Config Properties by @ParagEkbote in #433
- fix: LightevalTaskConfig.stop_sequence attribute by @ryan-minato in #463
- fix: scorer attribute initialization in ROUGE by @ryan-minato in #471
- Delete endpoint on InferenceEndpointTimeoutError by @albertvillanova in #475
- Remove unnecessary deepcopy in evaluation_tracker by @albertvillanova in #459
- fix: CACHE_DIR Default Value in Accelerate Pipeline by @ryan-minato in #461
- Fix warning about precedence of custom tasks over default ones in registry by @albertvillanova in #466
- Implement TGI model config from path by @albertvillanova in #448
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @clefourrier
- added musr (#375)
- Update README.md
- Use the programmatic interface using an already in memory loaded model (#390)
- Pr sadra (#393)
- Allowing a single prompt to use several formats for one eval (#398)
- Autoscaling inference endpoints (#412)
- add configs with their models (#421)
- Fix custom arabic tasks (#440)
- Adds serverless endpoints back (#445)
- Homogeneize generation params (#428)
- @JoelNiklaus
- @albertvillanova
- Update instance type/size in endpoint model_config example (#401)
- Typo in feature-request.md (#406)
- Add doc-builder doc-pr-upload GH Action (#411)
- Set up docs (#403)
- Add docstring docs (#413)
- Add missing models to docs (#419)
- Add trufflehog secrets detection (#429)
- Update docs about inference endpoints (#432)
- Fix ignored reuse_existing in config file (#431)
- Test inference endpoint model config parsing from path (#434)
- Fix imports from model_config (#443)
- Fix wrong instructions and code for custom tasks (#450)
- Fix minor typos (#449)
- Implement TGI model config from path (#448)
- Upgrade deprecated GH Action cache@v2 (#456)
- Add EvaluationTracker to docs and fix its docstring (#464)
- Remove unnecessary deepcopy in evaluation_tracker (#459)
- Fix warning about precedence of custom tasks over default ones in registry (#466)
- Checkout PR merge commit for CI tests (#468)
- Delete endpoint on InferenceEndpointTimeoutError (#475)
- @NathanHB
- @ParagEkbote
- Deprecate Obsolete Config Properties (#433)
- @alielfilali01
v0.6.0
What's New
Lighteval becomes massively multilingual!
We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.
-
Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
-
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
-
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
-
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
-
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
-
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
-
Serbian LLM Benchmark Task by @DeanChugall in #340
-
iroko bench by @hynky1999 in #357
Other Tasks
Features
- Now Evaluate OpenAI models by @NathanHB in #359
- New Doc and README by @NathanHB in #327
- Refacto LLM as A Judge by @NathanHB in #337
- Selecting tasks using their superset by @hynky1999 in #308
- Nicer output on task search failure by @hynky1999 in #357
- Adds tasks templating by @hynky1999 in #335
- Support for multilingual generative metrics by @hynky1999 in #293
- Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
- Translation literals by @hynky1999 in #356
Bug Fixes
- Math normalization: do not crash on invalid format by @guipenedo in #331
- Skipping push to hub test by @clefourrier in #334
- Fix Metrics import path in community task template file. by @chuandudx in #309
- Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
- Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
- Fix the dataset loading for custom tasks by @clefourrier in #364
- Fix: missing property tag in inference endpoints by @clefourrier in #368
- Fix Tokenization + misc fixes by @hynky1999 in #354
- Fix BLEURT evaluation errors by @chuandudx in #316
- Adds Baseline workflow + fixes by @hynky1999 in #363
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
- @DeanChugall
- Serbian LLM Benchmark Task (#340)
- @NathanHB
New Contributors
- @chuandudx made their first contribution in #323
- @edbeeching made their first contribution in #343
- @DeanChugall made their first contribution in #340
- @Stopwolf made their first contribution in #225
- @martinscooper made their first contribution in #366
Full Changelog: v0.5.0...v0.6.0
v0.5.0
What's new
Features
- Tokenization-wise encoding by @hynky1999 in #287
- Task config by @hynky1999 in #289
Bug fixes
v0.4.0
What's new
Features
- Adds vlmm as backend for insane speed up by @NathanHB in #274
- Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
- Abale to use config files for models by @clefourrier in #131
- List available tasks in the cli
lighteval tasks --listby @DimbyTa in #142 - Use torch compile for speed up by @clefourrier in #248
- Add maj@k metric by @clefourrier in #158
- Adds a dummy/random model for baseline init by @guipenedo in #220
- lighteval is now a cli tool:
lighteval --argsby @NathanHB in #152 - We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
- Configurable task versioning by @PhilipMay in #181
- Programmatic interface by @clefourrier in #269
- Probability Metric + New Normalization by @hynky1999 in #276
- Add widgets to the README by @clefourrier in #145
New tasks
- Add
Ger-RAG-evaltasks. by @PhilipMay in #149 - adding
aimocustom eval by @NathanHB in #154
Fixes
- Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
- Fix max_length type when being passed in model args by @csarron in #138
- Fix nanotron models input size bug by @clefourrier in #156
- Fix MATH normalization by @lewtun in #162
- fix Prompt function names by @clefourrier in #168
- Fix prompt format german rag community task by @jphme in #171
- add 'cite as' section in readme by @NathanHB in #178
- Fix broken link to extended tasks in README by @alexrs in #182
- Mention HF_TOKEN in readme by @Wauplin in #194
- Download BERT scorer lazily by @sadra-barikbin in #190
- Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
- fix llm as judge warnings by @NathanHB in #173
- ADD GPT-4 as Judge by @philschmid in #206
- Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
- Avoid truncating the outputs based on string lengths by @anton-l in #201
- Now only uses functions for prompt definition by @clefourrier in #213
- Data split depending on eval params by @clefourrier in #169
- should fix most inference endpoints issues of version config by @clefourrier in #226
- Fix _init_max_length in base_model.py by @gucci-j in #185
- Make evaluator invariant of input request type order by @sadra-barikbin in #215
- Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
- Fix IFEval metric by @lewtun in #259
- change priority when choosing model dtype by @NathanHB in #263
- Add grammar option to generation by @sadra-barikbin in #242
- make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
- Remove expensive prediction run during test collection by @hynky1999 in #279
- Example Configs and Docs by @RohitMidha23 in #255
- Refactoring the few shot management by @clefourrier in #272
- Standalone nanotron config by @hynky1999 in #285
- Logging Revamp by @hynky1999 in #284
- bump nltk version by @NathanHB in #290
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using
lighteval --args(#152) - adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
- @clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
- @PhilipMay
- @shaltielshmid
- @hynky1999
v0.3.0
Release Note
This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:
- Big Bench Hard: https://huggingface.co/papers/2210.09261
- AGIEval: https://huggingface.co/papers/2304.06364
- TinyBench:
- MT Bench: https://huggingface.co/papers/2306.05685
- AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/
MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.
New tasks
- Add BBH by @clefourrier in #7, @bilgehanertan in #126
- Add AGIEval by @clefourrier in #121
- Adding TinyBench by @clefourrier in #104
- Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
- Add mt-bench by @NathanHB in #75
Features
- Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
- Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124
Documentation
- Adding LICENSE by @clefourrier in #86, @NathanHB in #89
- Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94
Small patches
- Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
- Tidy up dependency groups by @lewtun in #81
- bump git python by @NathanHB in #90
- Sets a max length for the MATH task by @clefourrier in #83
- Fix parallel data processing bug by @clefourrier in #92
- Change the eos condition for GSM8K by @clefourrier in #85
- Fixing rolling loglikelihood management by @clefourrier in #78
- Fixes input length management for generative evals by @clefourrier in #103
- Reorder addition of instruction in chat template by @clefourrier in #111
- Ensure chat models terminate generation with EOS token by @lewtun in #115
- Fix push details to hub by @NathanHB in #98
- Small fixes to InferenceEndpointModel by @shaltielshmid in #112
- Fix import typo autogptq by @clefourrier in #116
- Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
- Fix TextGenerationResponse import from hfh by @Wauplin in #129
- Do not use deprecated list_files_info by @Wauplin in #133
- Update test workflow name to 'Tests' by @Wauplin in #134
New Contributors
- @shaltielshmid made their first contribution in #112
- @bilgehanertan made their first contribution in #126
- @Wauplin made their first contribution in #129
Full Changelog: v0.2.0...v0.3.0