Wicked: A Simple Method to Make Multiple Choice Benchmarks More Challenging

🎉 Wicked is now accepted to ACL 2025 - Main Conference.
📖 Arxiv Preprint: WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging

WiCkeD is originally implemented using the Eval-Harness tool. Currently, 6 mainstream benchmarks are supported:

MMLU WiCkeD Task Paper
MMLU-Pro WiCkeD Task Paper
MMLU-Redux WiCkeD Task Paper
AllenAI's Arc Challenge WiCkeD Task Paper
Commensense QA WiCkeD Task Paper
Truthful QA - MC1 task WiCkeD Task Paper

Models are evaluated with multiple-choice prompting and 0-shot chain of thoughts

How it works

Given a benchmark that consists of M examples, each has N choices: 1 correct answer and N − 1 distractors, we uniformly sample one option to be omitted and append the wildcard option None of the above to the remaining ones.

⚠️ WiCkeD can break the coherence of some questions! Therefore, we use an automatic classifier to identify such questions and do not apply WiCked to them.

For more details, please refer to our Wickedly Clever paper

Results

WiCkeD with Multiple Choice Prompting

WiCkeD with Chain of Thoughts

Installation

Requirements

Python >= 3.8

Create the virtual Environment

python -m venv $WORK/environments/eval-harness-env

Install Eval-Harness

git clone https://github.com/ahmedselhady/wicked.git
cd lm-evaluation-harness
pip install -e .

Evaluation Run Scripts

Example run scripts are available here

N.B For some models, you may need to add your 🤗 access token.

Citation

@misc{elhady2025wickedsimplemethodmake,
      title={WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging}, 
      author={Ahmed Elhady and Eneko Agirre and Mikel Artetxe},
      year={2025},
      eprint={2502.18316},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2502.18316}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
analysis		analysis
assets		assets
lm-evaluation-harness		lm-evaluation-harness
results.mcq		results.mcq
scripts		scripts
.gitignore		.gitignore
README.md		README.md
rename_results_dir.sh		rename_results_dir.sh

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Wicked: A Simple Method to Make Multiple Choice Benchmarks More Challenging

How it works

Results

WiCkeD with Multiple Choice Prompting

WiCkeD with Chain of Thoughts

Installation

Requirements

Create the virtual Environment

Install Eval-Harness

Evaluation Run Scripts

Citation

About

Uh oh!

Releases

Packages

Uh oh!

Contributors 2

Uh oh!

Languages

ahmedselhady/wicked-benchmarks

Folders and files

Latest commit

History

Repository files navigation

Wicked: A Simple Method to Make Multiple Choice Benchmarks More Challenging

How it works

Results

WiCkeD with Multiple Choice Prompting

WiCkeD with Chain of Thoughts

Installation

Requirements

Create the virtual Environment

Install Eval-Harness

Evaluation Run Scripts

Citation

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors 2

Uh oh!

Languages

Packages