[Do not merge] [pull] main from EleutherAI:main#54
Draft
pull[bot] wants to merge 477 commits intoopendatahub-io:mainfrom
Draft
[Do not merge] [pull] main from EleutherAI:main#54pull[bot] wants to merge 477 commits intoopendatahub-io:mainfrom
pull[bot] wants to merge 477 commits intoopendatahub-io:mainfrom
Conversation
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
Signed-off-by: kiersten-stokes <kierstenstokes@gmail.com>
* mmlu pro generation_kwargs until Q: -> Question: * pacify pre-commit * change stop token --------- Co-authored-by: Baber <baber@hey.com>
…2825) * add afrixnli to task * add chat completion * remove chat completion -untested * afrimmlu added * afrimmlu folder update * afrimmlu folder update * updated prompt * remove print * add afrimgsm -direct * add squad metric * fix bash script * remove direct util, update common yaml * remove print * add few show. metric fixes * fix direct path, add bash script for gpt models * added transate test * update afrixnli tasks * update afrixnli tasks * update metrics for afrixnli * prompt translations fix * prompt translations fix * filter and metric fix -mgsm * remove squad metric * remove squad metric * add f1 score to mgsm * add f1 score to mgsm * update native-direct with lin * change f1 function * add lin to utils * add utils * remove test limit * remove test configs * add swahili to mmlu * change eng to ewe in ewe yaml mmlu * add squad metric to mgsm, remove whitespace filter * added translate test * added afrixnli_translate * fix exact match valueError * fix exact match valueError * restructure mmlu folder * spacing * remove afrimmlu_translate folder * add utility * format task name, clean ups * modefied mgsm * update on afrimgsm * update on afrimgsm * removed utils * other mgsm varieties * other mgsm varieties * adding trasnslate direct * Update translate_direct_yaml * add manual xnli prompt, add multichoice for openai models, and adapt multichoice metric for openai model * edit for open models * Update translate_direct_yaml * add verbalizer for xnli * change xnli from multiple choice to generate * add manual accuracy scores * revert xnli to multiple choice * change afrimgsm utils * revert xnli to multiple_choice * cleanups and readmes * remove openai fixes and unused regex * pr review changes * revert metrics.py, task.py and extraction.py to main version * add afrisenti * utilities * pulled from main * add afrixnli * add afrimmlu * update afrixnli prompts * mising senti language * fix afrisenti prompt 2 * fix afrisenti prompts * fix afrisenti prompts * configure task grouping * add multiple prompts to afrixnli for irokobench * add multiple prompts to afrimmlu for irokobench * Update afrixnli_yaml * fixes and moves * fixes and moves * afrimmlu multiple prompts configs * remove validation set from afrimmlu * remove eng from afrimmlu translate test * correct dataset path * multiple prompts for mgsm * file restructure * afribench grouping * repo restructuring * repo restructuring * update exact match to hugging face exact match and add new mgsm language * remove decontamination * update generation kwargs * update generation kwargs for all mgsm prompts * remove lang * update generation kwargs for afrimgsm translatetest * add afrimgsm cot for direct and translate * remove eng from translate-cot * add masakhaPOS tasks * remove changes from task script * add masakhanews tasks * add uhura arc easy * add afriqa and belebele files * add tags for easier run. add naija rc * add new metrics and transformation scripts * fix afriqa swa fewshot split * add naijarc * add afrobench lite tasks * update afrobench * update afrobench * remove unverified files to avoid bugs * remove files not needed * add afrobench tasks * add afrobench tasks * change to version 1 * change to version 1 * update afrobench * update afrobench * restore metric to original script * update readme instructions * add individual dataset readmes * add link to collections * correct run script * align with main * align with main * align with main * align with main * align with main * align with main * align with main * align with main * failed run fixes * failed run fixes * add afrimgsm cot * Apply precommit fixes * update mafand dataset name * pull request fixes * remove afrihate due to availability --------- Co-authored-by: Israel Abebe Azime <azime@cg.uni-saarland.de> Co-authored-by: Israel Abebe Azime <se.israel.abebe@gmail.com> Co-authored-by: David Adelani <davlanade@gmail.com> Co-authored-by: theyorubayesian <akin.o.oladipo@gmail.com>
* added c4 dataset (working) * fixed bugs in c4 * fixed loading bugs in c4 dataset; using partial loading * cleaned the code * added version number for c4 * removed irrelevant files
…#2879) * fix: pass device arg in model_ar in vllm_causallms * casting device arg to str in vLLM model args
This function was written years ago when the cost of running an OpenAI model was easy to compute. It is no longer viable to support this.
* adding ACPBench_hard * adding Clingo * changing tarski to tarski[clingo] * denoting the main variants in each paper
* add `sglang-generate` * nit * nit * nit * pacify pre-commit
* Log tokenized request warning only once * Fix logging for concurrent usecase as well
* fix(output_path): support direct JSON file paths * fix linting * turn off external Lm tests for now * Update help text for `output_path` --------- Co-authored-by: Baber <baber@hey.com>
* use images with apis * pacify pre-commit
* first version of image resizing * fixed bug * clean up `resize_image` --------- Co-authored-by: Artem Safin <artemsafin67@gmail.com> Co-authored-by: Baber <baber@hey.com>
changed multimodal check from strict equality
* fix arguments * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>
fixes #2984) (#2987) * FIX error due to grouping queries with different continuation length Make Collator choose query with the longest continuation as the candidate for generation * use max for key selection * added comments explaining variable cont length (identical ctx+cont[:-1]) --------- Co-authored-by: Baber <baber@hey.com>
* add data_parallel for V1 * use Process instead of Queue * ray used if V0 DP * better error handling * fix truncation warning comparison
* add arab_culture tasks * add target_delimeter and remove debugging code
* chore: clean up and extend .gitignore rules * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>
* fix: bug in acc_mutual_info slicing; add `target_delimiter` to uncond choices * add tests
* feat: add mbpp_instruct * fix: update generation_kwargs to use an empty until list * fix: correct predictions formatting in pass_at_1 function * fix: improve code block extraction by checking first without opening backticks * fix mbpp `pass_at_1`
* Added BEAR task config * Improved README * Specified empty target_delimiter in BEAR As suggested by baberabb, the target_delimiter needs to be empty (and contains a single white space per default). Specifying it correctly reduces the score gap to the reference implementation (lm-pub-quiz).
* Fix call to `modify_gen_kwargs` in `vllm_vlms.py` Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> * pacify pre-commit --------- Signed-off-by: Harry Mellor <19981378+hmellor@users.noreply.github.com> Co-authored-by: Baber <baber@hey.com>
* fix winml incompatible parameters * fix lint issues
…EWS tasks (#3567) * fix: replace non-existent 'headline_text' with 'headline' in MasakhaNEWS tasks The MasakhaNEWS task templates referenced 'headline_text' which does not exist in the masakhane/masakhanews dataset. The actual column is 'headline'. This fix replaces all occurrences of 'headline_text' with 'headline' in: - All prompt template YAML files (prompt_1 through prompt_5) - Base task configuration files - doc_to_text and doc_to_decontamination_query fields Fixes #3516 * correct target field --------- Co-authored-by: Baber <baber@hey.com>
* fix: use answer_number directly in mgsm_direct doc_to_target
The mgsm_direct tasks used a string-slicing approach on the full CoT
answer field for doc_to_target, which caused few-shot examples to
include chain-of-thought reasoning even though the task expects only
the final numeric answer. This changes all 11 language variants to
use {{answer_number|string}} directly, consistent with the fix
already applied in the catalan_bench, basque_bench, galician_bench,
and spanish_bench variants.
Fixes #2444
* increment version
---------
Co-authored-by: Baber <baber@hey.com>
…gs`; fix cached `gen_kwargs` (#3582) * fix(vllm_vlms): fix gen kwargs; types * refactor(vllm_causallms, vllm_vlms): normalize `modify_gen_kwargs`; types
- Refactors TaskManager into four modules: TaskManager (public API), TaskIndex (YAML discovery), TaskFactory (object construction), and _yaml_loader (YAML parsing). - Replaces ConfigurableGroup with Group dataclass that holds direct child references and handles metric aggregation via Group.aggregate().
…ices` conflict (#3588) * fix(cli): remove `choices` from `--cache_requests` to fix argparse conflict `argparse` applies the `type` function before validating against `choices`. Since `type=request_caching_arg_to_dict` converts the string to a dict, the choices validation always fails because a dict never matches a string. Remove `choices` from the argument definition and move validation into `request_caching_arg_to_dict` via `argparse.ArgumentTypeError`. Bug introduced in b315ef3 (PR #3440). * nit: default bare flag to true --------- Co-authored-by: Baber Abbasi <baber@hey.com>
* support megatron-lm backend * update megatron-lm backend code * support data parallel/tensor parallel/pipeline parallel * update RAADEME.md for megatron-lm backend * remove --use-cpu-initialization * support expert model parallel * fix generate_until for ep * support batch_size > 1 for generate_until * fix tp > 1 * assert pp > 1 * trim trailing whitespace * fix: tokenizers and gen_kwargs * assert batch_size==1 for generate_until * fix: Fix accuracy degradation with batch_size > 1 > This fixes batched inference accuracy issues caused by padding tokens being visible to attention when using left-padded inputs. > > - Force TE SelfAttention to use AttnMaskType.arbitrary so the provided 4D (causal + padding) mask is honored (instead of TE assuming causal-only masking). > - For RoPE models, always use standard position_ids = [0..S-1] across the batch (avoid mask-derived position ids that can distort positional encoding). > - Construct and pass a 2D attention_mask (0=pad, 1=real) in _loglikelihood_tokens, matching generate_until, so USE_PADDING_MASK consistently controls whether padding is applied. > - Remove the batch_size == 1 restriction in generate_until and sort generation batches by token length to minimize padding. * fix a few pre-commit hook errors * pacify pre-commit --------- Co-authored-by: Baber <baber@hey.com>
* add missing progress bar * linting
- Support negative integers (e.g. -1) which isnumeric() missed - Add None/none detection - Add explicit quoting to force string type (e.g. revision="123123") - Add scientific notation support via float() fallback - Add comprehensive tests for handle_arg_string and simple_parse_args_string Fixes #2183, related to #2167
* fix: harden Megatron GPT layer spec setup for eval Disable torch compile by default to avoid runtime issues, add MoE/heterogeneous layer spec selection, and robustly override arbitrary attention masks across both single-layer and decoder-block specs. * avoid failure when attn_mask override is unavailable * implement distributed gather primitives for megatron eval * fix: satisfy ruff SIM201 in Megatron layer assertions
Signed-off-by: DarkLight1337 <tlleungac@connect.ust.hk>
Co-authored-by: joshuaswanson <joshuaswanson@users.noreply.github.com>
Co-authored-by: joshuaswanson <joshuaswanson@users.noreply.github.com>
…3626) * fix: propagate custom aggregation to dict-valued metric result keys When a custom metric function returns a dict (e.g. {'pass@1': 1.0, 'pass@3': 1.0}), process_results() stores each dict key in result_dict instead of the original metric name. _compute_task_aggregations() then looks up the aggregation function by those expanded keys and gets a KeyError, silently falling back to mean(). Fix: when expanding a dict-valued metric result, copy the custom aggregation (and higher_is_better flag) registered under the originating function name to each new key, so the correct aggregation is used. * add test --------- Co-authored-by: Baber Abbasi <92168766+baberabb@users.noreply.github.com>
* fix: Updated model mapping * fix: Fixed argument passing to WatsonxLLM and updated type hints
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
See Commits and Changes for more details.
Created by
pull[bot] (v2.0.0-alpha.4)
Can you help keep this open source service alive? 💖 Please sponsor : )