04 Nov 13:33

NathanHB

b77c6b2

v0.12.0 Latest

Latest

v0.12

Exciting release in which we pivot into using inspect-ai as backend and make tasks much easier to find and add thanks to a finder space: here

Screenshot 2025-11-04 at 14-32-45 Benchmark Finder - a Hugging Face Space by OpenEvals

New Features 🎉

Registry refactorisation by @clefourrier in #937
Multilingual extractiveness by @rolshoven in #956
Added backend_options parameter to llm judges. by @rolshoven in #963
Add automatic tests for metrics by @NathanHB in #939
Support local GGUF in VLLM and use HF tokenizer #943 by @JIElite in #972
[RFC] Rework the dependencies to be more versatile by @LysandreJik in #951
Sample to sample compare for integration tests by @NathanHB in #977
Move tasks to individual files by @NathanHB in #1016
Adds inspectai by @NathanHB in #1022

New Tasks

GSM-PLUS by @NathanHB in #780
TUMLU-mini by @ceferisbarov in #811
Filipino Benchmark by @ljvmiranda921 in #852
MMLU Redux by @clefourrier in #883
IFBench by @clefourrier in #944
SLR-Bench by @Ahmad21Omar in #983
MMLU pro by @NathanHB in #1031

Enhancement ⚙️

adds enable_prefix_caching option to VLLMModelConfig by @GAD-cell in #945
Added litellm model config options and improved _prepare_max_new_tokens by @rolshoven in #967
always provide parameters in the metric name to allow using several combinations by @clefourrier in #1017

Documentation 📚

Add org_to_bill parameter to documentation by @tfrere in #781
Update docs and enforces google's docstring style by @NathanHB in #941
Fix broken link by @JoelNiklaus in #1014
Update huggingface-cli login to use newer hf auth login by @Xceron in #1034

Task and Metrics changes 🛠️

Add Bulgarian and Macedonian literals by @dianaonutu in #769
Add TranslationLiterals for Language.DANISH by @spyysalo in #770
Update translation_literals.py with icelandic by @joenaess in #775
Complete TranslationLiterals for Language.ESTONIAN by @spyysalo in #779
Update translation_literals.py by @dianaonutu in #923
Fixing naming for sample evals + adding reqs in aime24 by @clefourrier in #989
add translation literals for various Indic languages (Bengali, Gujarati, Punjabi, Tamil) by @rpm000 in #1015

Bug Fixes 🐛

[#794] Fix: Assign SummaCZS instance to self.summac in Faithfulness metric by @sahilds1 in #795
Catch ROCM/HIP/AMD oom in should_reduce_batch_size by @mcleish7 in #812
Fix GPQA and index extractive metric by @clefourrier in #829
Update extractive_match_utils.py for words where : is preceded by a space by @clefourrier in #831
fixes from_model function and adds tests by @NathanHB in #921
fix tasks list by @alielfilali01 in #906
set upper bound on vllm version by @NathanHB in #964
Fixed bug that prevented the metrics from being mixed (batched/not batched) by @rolshoven in #958
Fix inference providers calls by @clefourrier in #1012
Fixing mixeval by @clefourrier in #1006
Fix typo in attribute name: CONCURENT_CALLS -> CONCURRENT_CALLS by @muupan in #884
Added ability to configure concurrent_requests in litellm_model.py by @dameikle in #911
Added fallback for incomplete configs for vlm models launched as llms by @clefourrier in #828

New Contributors

@pratyushmaini made their first contribution in #697
@DeVikingMark made their first contribution in #782
@sahilds1 made their first contribution in #795
@dianaonutu made their first contribution in #769
@tfrere made their first contribution in #781
@mcleish7 made their first contribution in #812
@leopardracer made their first contribution in #810
@spyysalo made their first contribution in #770
@ceferisbarov made their first contribution in #811
@joenaess made their first contribution in #775
@ryantzr1 made their first contribution in #784
@dtung8068 made their first contribution in #862
@muupan made their first contribution in #884
@NouamaneTazi made their first contribution in #841
@uralik made their first contribution in #887
@dameikle made their first contribution in #911
@ljvmiranda921 made their first contribution in #852
@cpcdoy made their first contribution in #502
@rolshoven made their first contribution in #958
@JIElite made their first contribution in #972
@LysandreJik made their first contribution in #951
@GAD-cell made their first contribution in #945
@amstu2 made their first contribution in #986
@Ahmad21Omar made their first contribution in #983
@cmpatino made their first contribution in #998
@rpm000 made their first contribution in #1015
@Xceron made their first contribution in #1034

Full Changelog: v0.10.0...v0.12.0

Contributors

spyysalo, muupan, and 29 other contributors

Assets 2

22 Sep 11:14

NathanHB

v0.11.0

cc69e42

V0.11.0

Lighteval v0.11.0

This release introduces major improvements and changes, across usability, stability, performance and documentation.

Highlights include a large refactor to simplify the architecture, automated metric tests, a dependency rework, improved documentation, and new tasks/benchmarks.

Highlights

Automated tests for metrics and stronger dependency checks
Continuous batching, caching, and faster CLI with reduced redundancy
Upgrade to datasets 4.0 and Trackio integration
Automatic chat template inference and reasoning trace support
New tasks: GSM-PLUS, TUMLU-mini, IFBench, Filipino benchmarks, MMLU Redux
Added Bulgarian, Macedonian, Danish, Icelandic, and Estonian literals
Documentation improvements (Google docstring style, README updates)

What's Changed

New Features

Automatic inference of chat template usage (no kwargs needed) by @clefourrier (#885)
More versatile dependency rework by @LysandreJik (#951)
Automatic tests for metrics by @NathanHB (#939)
Sample-to-sample comparisons for integration tests by @NathanHB (#977)
Continuous batching support by @NathanHB (#850) (arthur)
Refactored code and removed unused parts by @NathanHB (#709)
Post-processing for reasoning tokens in pipeline by @clefourrier (#882)
logging of system prompt by @clefourrier (#907)
Adds Caching of samples by @clefourrier (#909)
Upgrade to datasets 4.0 by @NathanHB (#924)
Trackio integration when available by @NathanHB (#930)
Parameterization of sampling evals from CLI by @clefourrier (#926)
Local GGUF support in VLLM with HF tokenizer by @JIElite (#972)

Enhancement

bootstrap_iters as an argument by @pratyushmaini (#697)
Load tasks before models by @clefourrier (#931)
Save reasoning_content from litellm as details by @muupan (#929)
Fix for TGI endpoint inference and JSON grammar generation by @cpcdoy (#502)
Reduced redundancy in CLI arguments by @NathanHB (#932)
Registry refactor by @clefourrier (#937)
Multilingual extractiveness support by @rolshoven (#956)
Added backend_options parameter to LLM judges by @rolshoven (#963)

Documentation

Added org_to_bill parameter by @tfrere (#781)
Updated docs with Google docstring style by @NathanHB (#941)
Updated README by @NathanHB (#961)

New Tasks

Added GSM-PLUS by @NathanHB (#780)
Added TUMLU-mini benchmark, fixed #577 by @ceferisbarov (#811)
Added Filipino benchmark community tasks by @ljvmiranda921 (#852)
MMLU Redux and caching fix by @clefourrier (#883)
Added IFBench by @clefourrier (#944)

Task and Metrics Changes

Added Bulgarian and Macedonian literals by @dianaonutu (#769)
Added Danish translation literals by @spyysalo (#770)
Added Icelandic translation literals by @joenaess (#775)
Completed Estonian translation literals by @spyysalo (#779)
Updated translation_literals.py by @dianaonutu (#923)

Bug Fixes

Fixed [#794]: assigned SummaCZS instance in Faithfulness metric by @sahilds1 (#795)
Caught ROCM/HIP/AMD OOM in should_reduce_batch_size by @mcleish7 (#812)
Fixed GPQA and index extractive metric by @clefourrier (#829)
Updated extractive_match_utils.py for cases with : by @clefourrier (#831)
Fixed from_model function and added tests by @NathanHB (#921)
Fixed tasks list by @alielfilali01 (#906)
Set upper bound on VLLM version by @NathanHB (#964)
Fixed batching bug in metrics by @rolshoven (#958)

Other Changes

Fixed typo in attribute name (CONCURENT_CALLS → CONCURRENT_CALLS) by @muupan (#884)
Added ability to configure concurrent_requests in litellm_model.py by @dameikle (#911)

New Contributors

We’re excited to welcome new contributors in this release:

@pratyushmaini, @DeVikingMark, @sahilds1, @dianaonutu, @tfrere, @mcleish7, @leopardracer, @spyysalo, @ceferisbarov, @joenaess, @ryantzr1, @dtung8068, @muupan, @NouamaneTazi, @uralik, @dameikle, @ljvmiranda921, @cpcdoy, @rolshoven, @JIElite, @LysandreJik

Full Changelog: v0.10.0...v0.11.0

Contributors

spyysalo, muupan, and 22 other contributors

Assets 2

22 May 14:54

NathanHB

v0.10.0

c4826ea

v0.10.0

We now support VLM when using transformers backend 🥳

What's Changed

New Features 🎉

Added support for quantization in vLLM backend by @SulRash in #690
Adds multimodal support and MMMU pro by @NathanHB in #675
Allow for model kwargs when loading transformers from pretrained by @NathanHB in #754
Adds template for custom path saving results by @NathanHB in #755
Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
Async vllm by @clefourrier in #693

New Tasks

Adds More Generative tasks by @hynky1999 in #694
Added Flores by @clefourrier in #717

Task and Metrics changes 🛠️

Nanotron, Multilingual tasks update + misc by @hynky1999 in #756
add livecodebench v6 by @Cppowboy in #712
Add MCQ support to Yourbench evaluation by @alozowski in #734

Other Changes

Bump ruff version by @NathanHB in #774
Fix revision arg for vLLM tokenizer by @lewtun in #721
Update README.md by @clefourrier in #733
Fix litellm by @NathanHB in #736

New Contributors

@Cppowboy made their first contribution in #712
@SulRash made their first contribution in #690
@Abelgurung made their first contribution in #743

Full Changelog: v0.9.2...v0.10.0

Contributors

Cppowboy, clefourrier, and 6 other contributors

Assets 2

06 May 10:02

NathanHB

v0.9.2

8d9ecd6

v0.9.2

What's Changed

New Features 🎉

enable together models and reasoning models as judges. by @JoelNiklaus in #537
Propagate vLLM batch size controls by @alvin319 in #588
Integrate huggingface_hub inference support for LLM as Judge by @alozowski in #651
add cot_prompt in vllm by @HERIUN in #654
Unify modelargs and use Pydantic for model configs by @NathanHB in #609
Improve test by @qubvel in #674
adds wandb loging of metrics by @NathanHB in #676
Adds wanddb logging by @NathanHB in #685
Added custom model inference. by @JoelNiklaus in #437
Update split iteration for DynamicBatchingDataset by @qubvel in #684

Documentation 📚

Add --use-chat-template to the broken litellm example by @eldarkurtic in #614
Lighteval math by @HERIUN in #630
Update quicktour command by @qubvel in #679
fix wrong 'custom_task_directory' in python api doc by @xgwang in #671
docs: improve consistency in punctuation of metric list by @mariagrandury in #605

New Tasks 📈

add arc agi 2 by @NathanHB in #642
Add G-Pass@k Metric by @jnanliu in #589
adds simpleqa by @NathanHB in #680

Task and Metrics changes 🛠️

Pass At K Math by @clefourrier in #647
Use n=16 samples to estimate pass@1 for AIME benchmarks by @lewtun in #661
adding uzbek literals by @shopulatov in #664
Align AIME pass@1 with literature by @lewtun in #666
Update LCB prompt & fix newlines by @rawsh in #645
fix gsm8k metric by @NathanHB in #688
Add pass@1 for GPQA-D and MATH-500 by @lewtun in #698

Bug Fixes 🐛

Use blfoat16 as default for vllm models. by @NathanHB in #638
Fix passing of generation config to main_accelerate by @LoserCheems in #659
Parse seed for vLLM by @eldarkurtic in #602
Parse string values for add_special_tokens in vLLM by @eldarkurtic in #598
hardcode configs to not make lighteval crash if lcb repo unavailable by @NathanHB in #677
tokenizer 'padding' param is not correct. by @xgwang in #669
Fix TransformersModel.from_model() method by @Vectorrent in #691
Inference providers by @clefourrier in #701

New Contributors

@DerekLiu35 made their first contribution in #620
@AnikiFan made their first contribution in #610
@alvin319 made their first contribution in #588
@alozowski made their first contribution in #643
@Laz4rz made their first contribution in #613
@shopulatov made their first contribution in #664
@HERIUN made their first contribution in #654
@rawsh made their first contribution in #645
@qubvel made their first contribution in #674
@xgwang made their first contribution in #669
@jnanliu made their first contribution in #589
@Vectorrent made their first contribution in #683
@omahs made their first contribution in #702

Full Changelog: v0.8.0...v0.9.0

Contributors

alvin319, xgwang, and 18 other contributors

Assets 2

24 Mar 10:46

NathanHB

v0.8.0

eee93d7

v0.8.0

What's new

Tasks

LiveCodeBench by @plaguss in #548, #587, #518
GPQA diamond by @lewtun in #534
Humanity's last exam by @clefourrier in #520
Olympiad Bench by @NathanHB in #521
aime24, 25 and math500 by @NathanHB in #586
french models Evals by @mdiazmel in #505

Metrics

Pass@k by @clefourrier in #519
Extractive Match metric by @hynky1999 in #495, #503, #522, #535

Features

Better logging

log model config by @NathanHB in #627
Support custom results/details push to hub by @albertvillanova in #457
Push details without converting fields to str by @NathanHB in #572

Inference providers

adds inference providers support by @NathanHB in #616

Load details to be evaluated

Implemented the possibility to load predictions from details files and continue evaluating from there by @JoelNiklaus in #488

sglang support

sglang by @Jayon02 in #552

Bug Fixes and refacto

Tiny improvements to endpoint_model.py, base_model.py,... by @sadra-barikbin in #219
Update README.md by @NathanHB in #486
Fix issue with encodings for together models. by @JoelNiklaus in #483
Made litellm judge backend more robust. by @JoelNiklaus in #485
Fix T_co import bug by @gucci-j in #484
fix README link by @vxw3t8fhjsdkghvbdifuk in #500
Fixed issue with o1 in litellm. by @JoelNiklaus in #493
Hotfix for litellm judge by @JoelNiklaus in #490
Made judge response processing more robust. by @JoelNiklaus in #491
VLLM: Allows for max tokens to be set in model config file by @NathanHB in #547
Bump up the latex2sympy2_extended version + more tests by @hynky1999 in #510
Fixed bug of import url_to_fs from fsspec by @LoserCheems in #507)
Fix Ukrainian indices and confirmation word by @ayukh in #516
Fix VLLM data-parallel by @hynky1999 in #541
relax spacy import to relax dep by @clefourrier in #622
vllm fix sampling params by @NathanHB in #625
relax deps for tgi by @NathanHB in #626
Bug fix extractive match by @hynky1999 in #540
Fix loading of vllm model from files by @NathanHB in #533
fix: broken URLs by @deep-diver in #550
typo(vllm): gpu_memory_utilisation typo by @tpoisonooo in #553
allows better flexibility for litellm endpoints by @NathanHB in #549
Translate task template to Catalan and Galician and fix typos by @mariagrandury in #506
Relax upper bound on torch by @lewtun in #508
Fix vLLM generation with sampling params by @lewtun in #578
Make BLEURT lazy by @hynky1999 in #536
Fixing backend error in main_sglang. by @TankNee in #597
VLLM + Math-Verify fixes by @hynky1999 in #603
raise exception when generation size is more than model length by @NathanHB in #571

Thanks

Huge thanks to Hyneck, Lewis, Ben, Agustín, Elie and everyone helping and and giving feedback 💙

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@hynky1999
- Extractive Match metric (#495)
- Fix math extraction (#503)
- Bump up the latex2sympy2_extended version + more tests (#510)
- Math extraction - allow only trying the first match, more customizable latex extraction + bump deps (#522)
- add missing inits (#524)
- Sync Math-verify (#535)
- Make BLEURT lazy (#536)
- Bug fix extractive match (#540)
- Fix VLLM data-parallel (#541)
- VLLM + Math-Verify fixes (#603)
@plaguss
- Add extended task for LiveCodeBench codegeneration (#548)
- Add subsets for lcb (#587)
@Jayon02
- Let lighteval support sglang (#552)
@NathanHB
- adds olympiad bench (#521)
- Fix loading of vllm model from files (#533)
- [VLLM] Allows for max tokens to be set in model config file (#547)
- allows better flexibility for litellm endpoints (#549)
- raise exception when generation size is more than model length (#571)
- Push details without converting fields to str (#572)
- adds aime24, 25 and math500 (#586)
- adds inference providers support (#616)
- vllm fix sampling params (#625)
- relax deps for tgi (#626)
- log model config (#627)

Contributors

JoelNiklaus, tpoisonooo, and 16 other contributors

Assets 2

03 Jan 15:45

NathanHB

v0.7.0

657978d

v0.7.0

What's New

New Tasks

added musr by @clefourrier in #375
Adds Global MLMU by @hynky1999 in #426
Add new Arabic benchmarks (5) and enhance existing tasks by @alielfilali01 in #372

New Features

Evaluate a model already loaded in memory for training / evaluation loop by @clefourrier in #390
Allowing a single prompt to use several formats for one eval by @clefourrier in #398
Autoscaling inference endpoints hardware by @clefourrier in #412
CLI new look and features (using typer) by @NathanHB in #407
Better Looking and more functional logging by @NathanHB in #415
Add litellm backend by @JoelNiklaus in #385

More Translation Literals by the Community

add bashkir variants by @AigizK in #374
add Shan (shn) translation literals by @NoerNova in #376
Add Udmurt (udm) translation literals by @codemurt in #381
This PR adds translation literals for Belarusian language. by @Kryuski in #382
added tatar literals by @gaydmi in #383

New Doc

Add doc-builder doc-pr-upload GH Action by @albertvillanova in #411
Set up docs by @albertvillanova in #403
Add docstring docs by @albertvillanova in #413
Add missing models to docs by @albertvillanova in #419
Update docs about inference endpoints by @albertvillanova in #432
Upgrade deprecated GH Action cache@v2 by @albertvillanova in #456
Add EvaluationTracker to docs and fix its docstring by @albertvillanova in #464
Checkout PR merge commit for CI tests by @albertvillanova in #468

Bug Fixes and Refacto

Allow AdapterModels to have custom tokens by @mapmeld in #306
Homogeneize generation params by @clefourrier in #428
fix: cache directory variable by @NazimHAli in #378
Add trufflehog secrets detection by @albertvillanova in #429
greedy_until() fix by @vsabolcec in #344
Fixes a TypeError for generative metrics. by @JoelNiklaus in #386
Speed up Bootstrapping Computation by @JoelNiklaus in #409
Fix imports from model_config by @albertvillanova in #443
Fix wrong instructions and code for custom tasks by @albertvillanova in #450
Fix minor typos by @albertvillanova in #449
fix model parallel by @NathanHB in #481
add configs with their models by @clefourrier in #421
Fixes a TypeError in Sacrebleu. by @JoelNiklaus in #387
fix ukr/rus by @hynky1999 in #394
fix repeated cleanup by @anton-l in #399
Update instance type/size in endpoint model_config example by @albertvillanova in #401
Considering the case empty request list is given to base model by @sadra-barikbin in #250
Fix a tiny bug in PromptManager::FewShotSampler::_init_fewshot_sampling_random by @sadra-barikbin in #423
Fix splitting for generative tasks by @NathanHB in #400
Fixes an error with getting the golds from the formatted_docs. by @JoelNiklaus in #388
Fix ignored reuse_existing in config file by @albertvillanova in #431
Deprecate Obsolete Config Properties by @ParagEkbote in #433
fix: LightevalTaskConfig.stop_sequence attribute by @ryan-minato in #463
fix: scorer attribute initialization in ROUGE by @ryan-minato in #471
Delete endpoint on InferenceEndpointTimeoutError by @albertvillanova in #475
Remove unnecessary deepcopy in evaluation_tracker by @albertvillanova in #459
fix: CACHE_DIR Default Value in Accelerate Pipeline by @ryan-minato in #461
Fix warning about precedence of custom tasks over default ones in registry by @albertvillanova in #466
Implement TGI model config from path by @albertvillanova in #448

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@clefourrier
- added musr (#375)
- Update README.md
- Use the programmatic interface using an already in memory loaded model (#390)
- Pr sadra (#393)
- Allowing a single prompt to use several formats for one eval (#398)
- Autoscaling inference endpoints (#412)
- add configs with their models (#421)
- Fix custom arabic tasks (#440)
- Adds serverless endpoints back (#445)
- Homogeneize generation params (#428)
@JoelNiklaus
- Fixes a TypeError for generative metrics. (#386)
- Fixes a TypeError in Sacrebleu. (#387)
- Fixes an error with getting the golds from the formatted_docs. (#388)
- Speed up Bootstrapping Computation (#409)
- Add litellm inference (#385)
@albertvillanova
- Update instance type/size in endpoint model_config example (#401)
- Typo in feature-request.md (#406)
- Add doc-builder doc-pr-upload GH Action (#411)
- Set up docs (#403)
- Add docstring docs (#413)
- Add missing models to docs (#419)
- Add trufflehog secrets detection (#429)
- Update docs about inference endpoints (#432)
- Fix ignored reuse_existing in config file (#431)
- Test inference endpoint model config parsing from path (#434)
- Fix imports from model_config (#443)
- Fix wrong instructions and code for custom tasks (#450)
- Fix minor typos (#449)
- Implement TGI model config from path (#448)
- Upgrade deprecated GH Action cache@v2 (#456)
- Add EvaluationTracker to docs and fix its docstring (#464)
- Remove unnecessary deepcopy in evaluation_tracker (#459)
- Fix warning about precedence of custom tasks over default ones in registry (#466)
- Checkout PR merge commit for CI tests (#468)
- Delete endpoint on InferenceEndpointTimeoutError (#475)
@NathanHB
- Fix splitting for generative tasks (#400)
- Nathan refacto cli (#407)
- redo logging (#415)
- option to list custom tasks (#425)
- fix model parallel (#481)
@ParagEkbote
- Deprecate Obsolete Config Properties (#433)
@alielfilali01
- Add new Arabic benchmarks (5) and enhance existing tasks (#372)
- Update arabic_evals.py: Fix custom arabic tasks [2nd attempt] (#444)

Contributors

mapmeld, AigizK, and 16 other contributors

Assets 2

23 Oct 16:02

NathanHB

v0.6.0

a11a1b2

v0.6.0

What's New

Lighteval becomes massively multilingual!

We now have extensive coverage in many languages, as well as new templates to manage multilinguality more easily.

Add 3 NLI tasks supporting 26 unique languages. #329 by @hynky1999
- xnli
- xnli2.0
- indic_xnli
- cmnli + ocnli
- rcb
Add 3 COPA tasks supporting about 20 unique languages. #330 by @hynky1999
- xcopa
- indic-copa
- parus
Add Hellaswag tasks supporting about 36 unique languages. #332 by @hynky1999
- mlmm_hellaswag
- hellaswag_{tha/tur}
Add RC tasks supporting about 130 unique languages/scripts. #333 by @hynky1999
- xquad
- thaiqa
- sber_squad
- arcd
- kenswquad
- chinese_squad
- cmrc2018
- indicqa
- fquad_v2
- tydiqa
- beleble
Add GK tasks supporting about 35 unique languages/scripts. #338 by @hynky1999
- meta_mmlu
- mlmm_mmlu
- rummlu
- mmlu_ara_mcf
- tur_leaderboard_mmlu
- cmmlu
- mmlu
- ceval
- mlmm_arc_challenge
- alghafa_arc_easy
- community_arc
- community_truthfulqa
- exams
- m3exams
- thai_exams
- xcsqa
- alghafa_piqa
- mera_openbookqa
- alghafa_openbookqa
- alghafa_sciqa
- mathlogic_qa
- agieval
- mera_worldtree
Misc Tasks #339 by @hynky1999
- openai_mmlu_tasks
- turkish_mmlu_tasks
- lumi arc
- hindi/swahili/arabic (from alghafa) arc
- cmath
- mgsm
- xcodah
- xstory
- xwinograd + tr winograd
- mlqa
- mkqa
- mintaka
- mlqa_tasks
- french triviaqa
- chegeka
- acva
- french_boolq
- hindi_boolq
Serbian LLM Benchmark Task by @DeanChugall in #340
iroko bench by @hynky1999 in #357

Other Tasks

MixEval Task by @NathanHB in #337

Features

Now Evaluate OpenAI models by @NathanHB in #359
New Doc and README by @NathanHB in #327
Refacto LLM as A Judge by @NathanHB in #337
Selecting tasks using their superset by @hynky1999 in #308
Nicer output on task search failure by @hynky1999 in #357
Adds tasks templating by @hynky1999 in #335
Support for multilingual generative metrics by @hynky1999 in #293
Class implementations of faithfulness and extractiveness metrics by @chuandudx in #323
Translation literals by @hynky1999 in #356

Bug Fixes

Math normalization: do not crash on invalid format by @guipenedo in #331
Skipping push to hub test by @clefourrier in #334
Fix Metrics import path in community task template file. by @chuandudx in #309
Allow kwargs for BERTScore compute function and remove unused var by @chuandudx in #311
Fixes sampling for vllm when num_samples==1 by @edbeeching in #343
Fix the dataset loading for custom tasks by @clefourrier in #364
Fix: missing property tag in inference endpoints by @clefourrier in #368
Fix Tokenization + misc fixes by @hynky1999 in #354
Fix BLEURT evaluation errors by @chuandudx in #316
Adds Baseline workflow + fixes by @hynky1999 in #363

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@hynky1999
- Support for multilingual generative metrics (#293)
- Adds tasks templating (#335)
- Multilingual NLI Tasks (#329)
- Multilingual COPA tasks (#330)
- Multilingual Hellaswag tasks (#332)
- Multilingual Reading Comprehension tasks (#333)
- Multilingual General Knowledge tasks (#338)
- Selecting tasks using their superset (#308)
- Fix Tokenization + misc fixes (#354)
- Misc-multilingual tasks (#339)
- add iroko bench + nicer output on task search failure (#357)
- Translation literals (#356)
- selected tasks for multilingual evaluation (#371)
- Adds Baseline workflow + fixes (#363)
@DeanChugall
- Serbian LLM Benchmark Task (#340)
@NathanHB
- readme rewrite (#327)
- refacto judge and add mixeval (#337)
- bump lighteval versoin (#328)
- fix (#347)
- Nathan llm judge quickfix (#348)
- Nathan llm judge quickfix (#350)
- adds openai models (#359)

New Contributors

@chuandudx made their first contribution in #323
@edbeeching made their first contribution in #343
@DeanChugall made their first contribution in #340
@Stopwolf made their first contribution in #225
@martinscooper made their first contribution in #366

Full Changelog: v0.5.0...v0.6.0

Contributors

guipenedo, DeanChugall, and 7 other contributors

Assets 2

24 Sep 13:38

NathanHB

v0.5.0

05dfa28

v0.5.0

What's new

Features

Tokenization-wise encoding by @hynky1999 in #287
Task config by @hynky1999 in #289

Bug fixes

Fixes bug: You can't create a model without either a list of model_args or a model_config_path when model_config_path was submited by @NathanHB in #298
skip tests if secrets not provided by @hynky1999 in #304
[FIX] vllm backend by @NathanHB in #317

Contributors

NathanHB and hynky1999

Assets 2

05 Sep 13:28

NathanHB

v0.4.0

ad6444e

v0.4.0

What's new

Features

Adds vlmm as backend for insane speed up by @NathanHB in #274
Add llm_as_judge in metrics (using both OpenAI or Transformers) by @NathanHB in #146
Abale to use config files for models by @clefourrier in #131
List available tasks in the cli lighteval tasks --list by @DimbyTa in #142
Use torch compile for speed up by @clefourrier in #248
Add maj@k metric by @clefourrier in #158
Adds a dummy/random model for baseline init by @guipenedo in #220
lighteval is now a cli tool: lighteval --args by @NathanHB in #152
We can now log info from the metrics (for example input and response from llm_as_judge) by @NathanHB in #157
Configurable task versioning by @PhilipMay in #181
Programmatic interface by @clefourrier in #269
Probability Metric + New Normalization by @hynky1999 in #276
Add widgets to the README by @clefourrier in #145

New tasks

Add Ger-RAG-evaltasks. by @PhilipMay in #149
adding aimo custom eval by @NathanHB in #154

Fixes

Bump nltlk to 3.9.1 to fix security issue by @NathanHB in #137
Fix max_length type when being passed in model args by @csarron in #138
Fix nanotron models input size bug by @clefourrier in #156
Fix MATH normalization by @lewtun in #162
fix Prompt function names by @clefourrier in #168
Fix prompt format german rag community task by @jphme in #171
add 'cite as' section in readme by @NathanHB in #178
Fix broken link to extended tasks in README by @alexrs in #182
Mention HF_TOKEN in readme by @Wauplin in #194
Download BERT scorer lazily by @sadra-barikbin in #190
Updated tgi_model and added parameters for endpoint_model by @shaltielshmid in #208
fix llm as judge warnings by @NathanHB in #173
ADD GPT-4 as Judge by @philschmid in #206
Fix a few typos and do a tiny refactor by @sadra-barikbin in #187
Avoid truncating the outputs based on string lengths by @anton-l in #201
Now only uses functions for prompt definition by @clefourrier in #213
Data split depending on eval params by @clefourrier in #169
should fix most inference endpoints issues of version config by @clefourrier in #226
Fix _init_max_length in base_model.py by @gucci-j in #185
Make evaluator invariant of input request type order by @sadra-barikbin in #215
Fixing issues with multichoice_continuations_start_space - was not parsed properly by @clefourrier in #232
Fix IFEval metric by @lewtun in #259
change priority when choosing model dtype by @NathanHB in #263
Add grammar option to generation by @sadra-barikbin in #242
make info loggers dataclass, so that their properties have expected lifetime by @hynky1999 in #280
Remove expensive prediction run during test collection by @hynky1999 in #279
Example Configs and Docs by @RohitMidha23 in #255
Refactoring the few shot management by @clefourrier in #272
Standalone nanotron config by @hynky1999 in #285
Logging Revamp by @hynky1999 in #284
bump nltk version by @NathanHB in #290

Significant community contributions

The following contributors have made significant changes to the library over the last release:

@NathanHB
- commit (#137)
- Add llm as judge in metrics (#146)
- Nathan add logging to metrics (#157)
- add 'cite as' section in readme (#178)
- Fix citation section in readme (#180)
- adding aimo custom eval (#154)
- fix llm as judge warnings (#173)
- launch lighteval using lighteval --args (#152)
- adds llm as judge using transformers (#223)
- Fix missing json file (#264)
- change priority when choosing model dtype (#263)
- fix the location of tasks list in the readme (#267)
- updates ifeval repo (#268)
- fix nanotron (#283)
- add vlmm backend (#274)
- bump nltk version (#290)
@clefourrier
- Add config files for models (#131)
- Add fun widgets to the README (#145)
- Fix nanotron models input size bug (#156)
- no function we actually use should be named prompt_fn (#168)
- Add maj@k metric (#158)
- Homogeneize logging system (#150)
- Use only dataclasses for task init (#212)
- Now only uses functions for prompt definition (#213)
- Data split depending on eval params (#169)
- should fix most inference endpoints issues of version config (#226)
- Add metrics as functions (#214)
- Quantization related issues (#224)
- Update issue templates (#235)
- remove latex writer since we don't use it (#231)
- Removes default bert scorer init (#234)
- fix (#233)
- udpated piqa (#222)
- uses torch compile if provided (#248)
- Fix inference endpoint config (#244)
- Expose samples via the CLI (#228)
- Fixing issues with multichoice_continuations_start_space - was not parsed properly (#232)
- Programmatic interface + cleaner management of requests (#269)
- Small file reorg (only renames/moves) (#271)
- Refactoring the few shot management (#272)
@PhilipMay
- Add Ger-RAG-evaltasks. (#149)
- Add version config option. (#181)
@shaltielshmid
- Added Namespace parameter for InferenceEndpoints, added option for passing model config directly (#147)
- Updated tgi_model and added parameters for endpoint_model (#208)
@hynky1999
- make info loggers dataclass, so that their properties have expected lifetime (#280)
- Remove expensive prediction run during test collection (#279)
- Probability Metric + New Normalization (#276)
- Standalone nanotron config (#285)
- Logging Revamp (#284)

Contributors

PhilipMay, alexrs, and 15 other contributors

Assets 2

29 Mar 16:42

NathanHB

v0.3.0

e6fdcc7

v0.3.0

Release Note

This introduced the new extended tasks feature, documentation and many other patches for improved stability.
New tasks are also introduced:

Big Bench Hard: https://huggingface.co/papers/2210.09261
AGIEval: https://huggingface.co/papers/2304.06364
TinyBench:
MT Bench: https://huggingface.co/papers/2306.05685
AlGhafa Benchmarking Suite: https://aclanthology.org/2023.arabicnlp-1.21/

MT-Bench marks the introduction of multi-turn prompting as well as llm-as-a-judge metric.

New tasks

Add BBH by @clefourrier in #7, @bilgehanertan in #126
Add AGIEval by @clefourrier in #121
Adding TinyBench by @clefourrier in #104
Adding support for Arabic benchmarks : AlGhafa benchmarking suite by @alielfilali01 in #95
Add mt-bench by @NathanHB in #75

Features

Extended Tasks ! by @clefourrier in #101, @lewtun in #108, @NathanHB in #122, #123
Added support for launching inference endpoint with different model dtypes by @shaltielshmid in #124

Documentation

Adding LICENSE by @clefourrier in #86, @NathanHB in #89
Make it clearer in the README that the leaderboard uses the harness by @clefourrier in #94

Small patches

Update huggingface-hub for compatibility with datasets 2.18 by @clefourrier in #84
Tidy up dependency groups by @lewtun in #81
bump git python by @NathanHB in #90
Sets a max length for the MATH task by @clefourrier in #83
Fix parallel data processing bug by @clefourrier in #92
Change the eos condition for GSM8K by @clefourrier in #85
Fixing rolling loglikelihood management by @clefourrier in #78
Fixes input length management for generative evals by @clefourrier in #103
Reorder addition of instruction in chat template by @clefourrier in #111
Ensure chat models terminate generation with EOS token by @lewtun in #115
Fix push details to hub by @NathanHB in #98
Small fixes to InferenceEndpointModel by @shaltielshmid in #112
Fix import typo autogptq by @clefourrier in #116
Fixed the loglikelihood method in inference endpoints models by @clefourrier in #119
Fix TextGenerationResponse import from hfh by @Wauplin in #129
Do not use deprecated list_files_info by @Wauplin in #133
Update test workflow name to 'Tests' by @Wauplin in #134

New Contributors

@shaltielshmid made their first contribution in #112
@bilgehanertan made their first contribution in #126
@Wauplin made their first contribution in #129

Full Changelog: v0.2.0...v0.3.0

Contributors

Wauplin, shaltielshmid, and 5 other contributors

Assets 2

Releases: huggingface/lighteval

v0.12.0

v0.12

New Features 🎉

New Tasks

Enhancement ⚙️

Documentation 📚

Task and Metrics changes 🛠️

Bug Fixes 🐛

New Contributors

Contributors

Uh oh!

V0.11.0

Lighteval v0.11.0

Highlights

What's Changed

New Features

Enhancement

Documentation

New Tasks

Task and Metrics Changes

Bug Fixes

Other Changes

New Contributors

Contributors

Uh oh!

v0.10.0

What's Changed

New Features 🎉

New Tasks

Task and Metrics changes 🛠️

Other Changes

New Contributors

Contributors

Uh oh!

v0.9.2

What's Changed

New Features 🎉

Documentation 📚

New Tasks 📈

Task and Metrics changes 🛠️

Bug Fixes 🐛

New Contributors

Contributors

Uh oh!

v0.8.0

What's new

Tasks

Metrics

Features

Better logging

Inference providers

Load details to be evaluated

sglang support

Bug Fixes and refacto

Thanks

Significant community contributions

Contributors

Uh oh!

v0.7.0

What's New

New Tasks

New Features

More Translation Literals by the Community

New Doc

Bug Fixes and Refacto

Significant community contributions

Contributors

Uh oh!

v0.6.0

What's New

Lighteval becomes massively multilingual!

Other Tasks

Features

Bug Fixes

Significant community contributions

New Contributors

Contributors

Uh oh!

v0.5.0