Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

A repository containing the original code and outputs for the ACL 2025 paper "Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering" by Francesco Maria Molfese, Luca Moroni, Luca Gioffré, Alessandro Scirè, Simone Conia, Roberto Navigli.

🛠️ Installation

Installation from source:

git clone https://github.com/sapienzanlp/MetaQAEval.git
cd metaQAeval
conda create -n meta-qa-eval python==3.12
conda activate meta-qa-eval
pip install -e .

Reproducibility

Under the scripts folder, we provide the scripts to:

Generate the outputs for a specific LLM over a given dataset.
Run the evaluation of the outputs generated with an LLM over a given dataset using:
- RegEx and xFinder (for free-text generation).
- Logprobs and Perplexity for first-token probabilities.
Obtain the statistics for the MMLU categories and subcategories (MMLU domains).
Run the adversarial experiments for LLM-based evaluation strategies (xFinder) using:
- Our newly-introduced resource: MMLU-Adversarial.
- Prompts testing the ability of xFinder to solve the MCQA task.

Parameter ranges

In the subsequent scripts you can specify some parameters: DATASET, DATASET_NAME, and MODEL.

DATASET can be: mmlu, arc, obqa.
DATASET_NAME is the dataset_name field of the dataset in the original hf repository. For the mmlu the name will be all to evaluate all the categories.
MODEL is the hf id of the chosen model.

Generate the outputs:

bash scripts/generate_output.sh <DATASET> <DATASET_NAME> <MODEL>

For example:

bash scripts/generate_output.sh mmlu all meta-llama/Llama-3.1-8B-Instruct

Evaluate the models:

bash scripts/eval.sh <DATASET> <MODEL> # will execute regex and xfinder evaluation
bash scripts/logprobs.sh <DATASET> <MODEL>
bash scripts/perplexity.sh <DATASET> <MODEL>

For example:

bash scripts/eval.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/logprobs.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/perplexity.sh mmlu meta-llama/Llama-3.1-8B-Instruct

Statistics for the MMLU domains:

bash scripts/mmlu_domains_analysis.sh <MODEL>

For example:

bash scripts/mmlu_domains_analysis.sh meta-llama/Llama-3.1-8B-Instruct

Adversarial experiments:

bash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh <DATASET> <DATASET_NAME> <PROMPT_ID>

The PROMPT_ID is the id of the prompt in prompts/xfinder_adversarial_prompts.json.

For example:

bash scripts/xfinder_mmlu-adversarial.sh 
bash scripts/xfinder_adversarial.sh arc ARC-Challenge 1

Cite this work

If you use any part of this work, please consider citing the paper as follows:

@inproceedings{molfese2025rightanswerwrongscore,
  title={Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering}, 
  author={Francesco Maria Molfese and Luca Moroni and Luca Gioffrè and Alessandro Scirè and Simone Conia and Roberto Navigli},
  booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
  pages={},
  year={2025}
}

🪪 License

The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.

Acknowledgements

We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR).

Name		Name	Last commit message	Last commit date
Latest commit History 13 Commits
answer_patterns		answer_patterns
evaluation		evaluation
evaluation_xfinder		evaluation_xfinder
output		output
plots		plots
prompts		prompts
scripts		scripts
src		src
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

🛠️ Installation

Reproducibility

Parameter ranges

Generate the outputs:

Evaluate the models:

Statistics for the MMLU domains:

Adversarial experiments:

Cite this work

🪪 License

Acknowledgements

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering

🛠️ Installation

Reproducibility

Parameter ranges

Generate the outputs:

Evaluate the models:

Statistics for the MMLU domains:

Adversarial experiments:

Cite this work

🪪 License

Acknowledgements

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages