Skip to content

ahmedselhady/wicked-benchmarks

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

25 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Logo

Wicked: A Simple Method to Make Multiple Choice Benchmarks More Challenging


WiCkeD is originally implemented using the Eval-Harness tool. Currently, 6 mainstream benchmarks are supported:

  1. MMLU WiCkeD Task Paper
  2. MMLU-Pro WiCkeD Task Paper
  3. MMLU-Redux WiCkeD Task Paper
  4. AllenAI's Arc Challenge WiCkeD Task Paper
  5. Commensense QA WiCkeD Task Paper
  6. Truthful QA - MC1 task WiCkeD Task Paper

Models are evaluated with multiple-choice prompting and 0-shot chain of thoughts

How it works

WiCkeD examples

Given a benchmark that consists of M examples, each has N choices: 1 correct answer and N − 1 distractors, we uniformly sample one option to be omitted and append the wildcard option None of the above to the remaining ones.

⚠️ WiCkeD can break the coherence of some questions! Therefore, we use an automatic classifier to identify such questions and do not apply WiCked to them.

For more details, please refer to our Wickedly Clever paper

Results

WiCkeD with Multiple Choice Prompting

MCQ Results

WiCkeD with Chain of Thoughts

CoT Results

Installation

Requirements

Python >= 3.8

Create the virtual Environment

python -m venv $WORK/environments/eval-harness-env

Install Eval-Harness

git clone https://github.com/ahmedselhady/wicked.git
cd lm-evaluation-harness
pip install -e . 

Evaluation Run Scripts

Example run scripts are available here

N.B For some models, you may need to add your 🤗 access token.

Citation

@misc{elhady2025wickedsimplemethodmake,
      title={WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging}, 
      author={Ahmed Elhady and Eneko Agirre and Mikel Artetxe},
      year={2025},
      eprint={2502.18316},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18316}, 
}

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •