- 🎉 Wicked is now accepted to ACL 2025 - Main Conference.
- 📖 Arxiv Preprint: WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging
WiCkeD is originally implemented using the Eval-Harness tool. Currently, 6 mainstream benchmarks are supported:
- MMLU WiCkeD Task Paper
- MMLU-Pro WiCkeD Task Paper
- MMLU-Redux WiCkeD Task Paper
- AllenAI's Arc Challenge WiCkeD Task Paper
- Commensense QA WiCkeD Task Paper
- Truthful QA - MC1 task WiCkeD Task Paper
Models are evaluated with multiple-choice prompting and 0-shot chain of thoughts
Given a benchmark that consists of M examples, each has N choices: 1 correct answer and N − 1 distractors, we uniformly sample one option to be omitted and append the wildcard option None of the above to the remaining ones.
For more details, please refer to our Wickedly Clever paper
Python >= 3.8
python -m venv $WORK/environments/eval-harness-env
git clone https://github.com/ahmedselhady/wicked.git
cd lm-evaluation-harness
pip install -e .
Example run scripts are available here
N.B For some models, you may need to add your 🤗 access token.
@misc{elhady2025wickedsimplemethodmake,
title={WiCkeD: A Simple Method to Make Multiple Choice Benchmarks More Challenging},
author={Ahmed Elhady and Eneko Agirre and Mikel Artetxe},
year={2025},
eprint={2502.18316},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2502.18316},
}

.png)

