[Do not merge] [pull] main from EleutherAI:main by pull[bot] · Pull Request #54 · opendatahub-io/lm-evaluation-harness

pull · 2025-10-08T14:19:07Z

See Commits and Changes for more details.

Created by pull[bot] (v2.0.0-alpha.4)

Can you help keep this open source service alive? 💖 Please sponsor : )

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

* mmlu pro generation_kwargs until Q: -> Question: * pacify pre-commit * change stop token --------- Co-authored-by: Baber <baber@hey.com>

…2825) * add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>

* added c4 dataset (working) * fixed bugs in c4 * fixed loading bugs in c4 dataset; using partial loading * cleaned the code * added version number for c4 * removed irrelevant files

…#2879) * fix: pass device arg in model_ar in vllm_causallms * casting device arg to str in vLLM model args

This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.

* adding ACPBench_hard * adding Clingo * changing tarski to tarski[clingo] * denoting the main variants in each paper

* add `sglang-generate` * nit * nit * nit * pacify pre-commit

* Log tokenized request warning only once * Fix logging for concurrent usecase as well

* fix(output_path): support direct JSON file paths * fix linting * turn off external Lm tests for now * Update help text for `output_path` --------- Co-authored-by: Baber <baber@hey.com>

* use images with apis * pacify pre-commit

* first version of image resizing * fixed bug * clean up `resize_image` --------- Co-authored-by: Artem Safin <artemsafin67@gmail.com> Co-authored-by: Baber <baber@hey.com>

This reverts commit 4dbd5ec

changed multimodal check from strict equality

* fix arguments * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

fixes #2984) (#2987) * FIX error due to grouping queries with different continuation length Make Collator choose query with the longest continuation as the candidate for generation * use max for key selection * added comments explaining variable cont length (identical ctx+cont[:-1]) --------- Co-authored-by: Baber <baber@hey.com>

* add data_parallel for V1 * use Process instead of Queue * ray used if V0 DP * better error handling * fix truncation warning comparison

* add arab_culture tasks * add target_delimeter and remove debugging code

* chore: clean up and extend .gitignore rules * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices * add tests

* feat: add mbpp_instruct * fix: update generation_kwargs to use an empty until list * fix: correct predictions formatting in pass_at_1 function * fix: improve code block extraction by checking first without opening backticks * fix mbpp `pass_at_1`

* Added BEAR task config * Improved README * Specified empty target_delimiter in BEAR As suggested by baberabb, the target_delimiter needs to be empty (and contains a single white space per default). Specifying it correctly reduces the score gap to the reference implementation (lm-pub-quiz).

* Fix call to `modify_gen_kwargs` in `vllm_vlms.py` Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * pacify pre-commit --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Baber <baber@hey.com>

* fix winml incompatible parameters * fix lint issues

…EWS tasks (#3567) * fix: replace non-existent 'headline_text' with 'headline' in MasakhaNEWS tasks The MasakhaNEWS task templates referenced 'headline_text' which does not exist in the masakhane/masakhanews dataset. The actual column is 'headline'. This fix replaces all occurrences of 'headline_text' with 'headline' in: - All prompt template YAML files (prompt_1 through prompt_5) - Base task configuration files - doc_to_text and doc_to_decontamination_query fields Fixes #3516 * correct target field --------- Co-authored-by: Baber <baber@hey.com>

* fix: use answer_number directly in mgsm_direct doc_to_target The mgsm_direct tasks used a string-slicing approach on the full CoT answer field for doc_to_target, which caused few-shot examples to include chain-of-thought reasoning even though the task expects only the final numeric answer. This changes all 11 language variants to use {{answer_number|string}} directly, consistent with the fix already applied in the catalan_bench, basque_bench, galician_bench, and spanish_bench variants. Fixes #2444 * increment version --------- Co-authored-by: Baber <baber@hey.com>

…gs`; fix cached `gen_kwargs` (#3582) * fix(vllm_vlms): fix gen kwargs; types * refactor(vllm_causallms, vllm_vlms): normalize `modify_gen_kwargs`; types

- Refactors TaskManager into four modules: TaskManager (public API), TaskIndex (YAML discovery), TaskFactory (object construction), and _yaml_loader (YAML parsing). - Replaces ConfigurableGroup with Group dataclass that holds direct child references and handles metric aggregation via Group.aggregate().

…ices` conflict (#3588) * fix(cli): remove `choices` from `--cache_requests` to fix argparse conflict `argparse` applies the `type` function before validating against `choices`. Since `type=request_caching_arg_to_dict` converts the string to a dict, the choices validation always fails because a dict never matches a string. Remove `choices` from the argument definition and move validation into `request_caching_arg_to_dict` via `argparse.ArgumentTypeError`. Bug introduced in b315ef3 (PR #3440). * nit: default bare flag to true --------- Co-authored-by: Baber Abbasi <baber@hey.com>

* support megatron-lm backend * update megatron-lm backend code * support data parallel/tensor parallel/pipeline parallel * update RAADEME.md for megatron-lm backend * remove --use-cpu-initialization * support expert model parallel * fix generate_until for ep * support batch_size > 1 for generate_until * fix tp > 1 * assert pp > 1 * trim trailing whitespace * fix: tokenizers and gen_kwargs * assert batch_size==1 for generate_until * fix: Fix accuracy degradation with batch_size > 1 > This fixes batched inference accuracy issues caused by padding tokens being visible to attention when using left-padded inputs. > > - Force TE SelfAttention to use AttnMaskType.arbitrary so the provided 4D (causal + padding) mask is honored (instead of TE assuming causal-only masking). > - For RoPE models, always use standard position_ids = [0..S-1] across the batch (avoid mask-derived position ids that can distort positional encoding). > - Construct and pass a 2D attention_mask (0=pad, 1=real) in _loglikelihood_tokens, matching generate_until, so USE_PADDING_MASK consistently controls whether padding is applied. > - Remove the batch_size == 1 restriction in generate_until and sort generation batches by token length to minimize padding. * fix a few pre-commit hook errors * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

…#3601)

* add missing progress bar * linting

- Support negative integers (e.g. -1) which isnumeric() missed - Add None/none detection - Add explicit quoting to force string type (e.g. revision="123123") - Add scientific notation support via float() fallback - Add comprehensive tests for handle_arg_string and simple_parse_args_string Fixes #2183, related to #2167

* fix: harden Megatron GPT layer spec setup for eval Disable torch compile by default to avoid runtime issues, add MoE/heterogeneous layer spec selection, and robustly override arbitrary attention masks across both single-layer and decoder-block specs. * avoid failure when attn_mask override is unavailable * implement distributed gather primitives for megatron eval * fix: satisfy ruff SIM201 in Megatron layer assertions

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Co-authored-by: joshuaswanson <joshuaswanson@users.noreply.github.com>

) * Fix correctness issues in Arabic normalization and prompt loading * increment mlqa version --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

…3626) * fix: propagate custom aggregation to dict-valued metric result keys When a custom metric function returns a dict (e.g. {'pass@1': 1.0, 'pass@3': 1.0}), process_results() stores each dict key in result_dict instead of the original metric name. _compute_task_aggregations() then looks up the aggregation function by those expanded keys and gets a KeyError, silently falling back to mean(). Fix: when expanding a dict-valued metric result, copy the custom aggregation (and higher_is_better flag) registered under the originating function name to each new key, so the correct aggregation is used. * add test --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

#3570)

* fix: Updated model mapping * fix: Fixed argument passing to WatsonxLLM and updated type hints

llsj14 and others added 30 commits May 10, 2025 08:53

fix: type error while checking context length (#2972)

1c03af3

Fix import error for deepcopy (#2969)

24fc1a4

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

Pin unitxt to most recent major version to avoid test failures (#2970)

af8b87c

Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>

mmlu pro generation_kwargs until Q: -> Question: (#2945)

cf51e69

* mmlu pro generation_kwargs until Q: -> Question: * pacify pre-commit * change stop token --------- Co-authored-by: Baber <baber@hey.com>

Added C4 Support (#2889)

86a3b27

* added c4 dataset (working) * fixed bugs in c4 * fixed loading bugs in c4 dataset; using partial loading * cleaned the code * added version number for c4 * removed irrelevant files

Update utils.py (#2870)

2bde99e

feat: add question suffix (#2876)

4dbd5ec

Add device arg to model_args passed to LLM object in VLLM model class (…

96966f5

…#2879) * fix: pass device arg in model_ar in vllm_causallms * casting device arg to str in vLLM model args

fix formatting (#2759)

0126f6d

Delete scripts/cost_estimate.py (#2985)

86c266a

This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.

Adding ACPBench Hard tasks (#2980)

0daf28f

* adding ACPBench_hard * adding Clingo * changing tarski to tarski[clingo] * denoting the main variants in each paper

[SGLANG] Add the SGLANG generate API (#2997)

53c6530

* add `sglang-generate` * nit * nit * nit * pacify pre-commit

fix github parse error (#2998)

81fc082

Log tokenized request warning only once (#3002)

07e5348

* Log tokenized request warning only once * Fix logging for concurrent usecase as well

add kbl 2025 (#3000)

8be417a

Output path fix (#2993)

178fa84

* fix(output_path): support direct JSON file paths * fix linting * turn off external Lm tests for now * Update help text for `output_path` --------- Co-authored-by: Baber <baber@hey.com>

use images with api models (#2981)

2cfdd0a

* use images with apis * pacify pre-commit

Adding resize images support (#2958)

143a7fe

* first version of image resizing * fixed bug * clean up `resize_image` --------- Co-authored-by: Artem Safin <artemsafin67@gmail.com> Co-authored-by: Baber <baber@hey.com>

Revert "feat: add question suffix (#2876)" (#3007)

29ea683

This reverts commit 4dbd5ec

change multimodal check in evaluate (#3013)

e1a7a39

changed multimodal check from strict equality

[Fix] Update resolve_hf_chat_template arguments (#2992)

357d4ea

* fix arguments * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

[vllm] data parallel for V1 (#3011)

5a481f4

* add data_parallel for V1 * use Process instead of Queue * ray used if V0 DP * better error handling * fix truncation warning comparison

add arab_culture task (#3006)

8bc4aff

* add arab_culture tasks * add target_delimeter and remove debugging code

chore: clean up and extend .gitignore rules (#3030)

9d29ef0

* chore: clean up and extend .gitignore rules * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>

Enable text-only evals for VLM models (#2999)

82a9936

[Fix] acc_mutual_info metric calculation bug (#3035)

3f79295

* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices * add tests

fix: fix vllm issue with DP>1 (#3025)

d57e3d6

add Mbpp instruct (#2995)

60e85da

* feat: add mbpp_instruct * fix: update generation_kwargs to use an empty until list * fix: correct predictions formatting in pass_at_1 function * fix: improve code block extraction by checking first without opening backticks * fix mbpp `pass_at_1`

plonerma and others added 30 commits February 6, 2026 18:06

fix(winml): fix winml incompatible parameters (#3575)

e45984c

* fix winml incompatible parameters * fix lint issues

pacify pre-commit (#3577)

a5b407d

ignore ruff rule S101 (assert-used) (#3579)

eccfc5b

refactor(vllm): inline gen_kwargs normalization to `modify_gen_kwar…

413588c

…gs`; fix cached `gen_kwargs` (#3582) * fix(vllm_vlms): fix gen kwargs; types * refactor(vllm_causallms, vllm_vlms): normalize `modify_gen_kwargs`; types

increment version (#3585)

27988a2

Fix: #3293 (pybass UnboundLocalError on outputs in Exception Logging) (…

a6c36ed

…#3601)

[fix] Add missing tokenization progress bar (#3605)

9ffc779

* add missing progress bar * linting

Update vLLM import of resolve_hf_chat_template (#3595)

d800e04

Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>

Add docstring for HFLM __init__ keyword arguments (#3630)

733a574

Co-authored-by: joshuaswanson <joshuaswanson@users.noreply.github.com>

replace all CohereForAI with CohereLabs (#3631)

6e23116

Skip caching None responses in async generation path (#3633)

d47ed3e

Co-authored-by: joshuaswanson <joshuaswanson@users.noreply.github.com>

Fix correctness issues in Arabic normalization and prompt loading (#3589

7507703

) * Fix correctness issues in Arabic normalization and prompt loading * increment mlqa version --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>

fix(evaluate tests) (#3634)

759ead5

chore(ci-updates) (#3635)

7038349

Add @0xSMT as a code owner

795598f

Update dataset link (#3596)

558e232

Add jfinqa: Japanese Financial Numerical Reasoning QA (1000 questions) (

eb9253a

#3570)

Rename SteeredHF to SteeredModel in lm_eval/models/__init__.py (#3592)

679b0f2

fix: Update WatsonxLLM class mapping and errors (#3591)

6080f73

* fix: Updated model mapping * fix: Fixed argument passing to WatsonxLLM and updated type hints

Intel Gaudi support via optimum-habana (#3550)

ee7e8f4

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Do not merge] [pull] main from EleutherAI:main#54

[Do not merge] [pull] main from EleutherAI:main#54
pull[bot] wants to merge 477 commits intoopendatahub-io:mainfrom
EleutherAI:main

pull bot commented Oct 8, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

Conversation

pull bot commented Oct 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

20 participants

pull bot commented Oct 8, 2025 •

edited

Loading