Installation from source:
git clone https://github.com/sapienzanlp/MetaQAEval.git
cd metaQAeval
conda create -n meta-qa-eval python==3.12
conda activate meta-qa-eval
pip install -e .Under the scripts folder, we provide the scripts to:
- Generate the outputs for a specific LLM over a given dataset.
- Run the evaluation of the outputs generated with an LLM over a given dataset using:
- RegEx and xFinder (for free-text generation).
- Logprobs and Perplexity for first-token probabilities.
- Obtain the statistics for the MMLU categories and subcategories (MMLU domains).
- Run the adversarial experiments for LLM-based evaluation strategies (xFinder) using:
- Our newly-introduced resource: MMLU-Adversarial.
- Prompts testing the ability of xFinder to solve the MCQA task.
In the subsequent scripts you can specify some parameters: DATASET, DATASET_NAME, and MODEL.
DATASETcan be:mmlu,arc,obqa.DATASET_NAMEis thedataset_namefield of the dataset in the original hf repository. For themmluthe name will beallto evaluate all the categories.MODELis the hf id of the chosen model.
bash scripts/generate_output.sh <DATASET> <DATASET_NAME> <MODEL>For example:
bash scripts/generate_output.sh mmlu all meta-llama/Llama-3.1-8B-Instructbash scripts/eval.sh <DATASET> <MODEL> # will execute regex and xfinder evaluation
bash scripts/logprobs.sh <DATASET> <MODEL>
bash scripts/perplexity.sh <DATASET> <MODEL>For example:
bash scripts/eval.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/logprobs.sh mmlu meta-llama/Llama-3.1-8B-Instruct
bash scripts/perplexity.sh mmlu meta-llama/Llama-3.1-8B-Instructbash scripts/mmlu_domains_analysis.sh <MODEL>For example:
bash scripts/mmlu_domains_analysis.sh meta-llama/Llama-3.1-8B-Instructbash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh <DATASET> <DATASET_NAME> <PROMPT_ID>The PROMPT_ID is the id of the prompt in prompts/xfinder_adversarial_prompts.json.
For example:
bash scripts/xfinder_mmlu-adversarial.sh
bash scripts/xfinder_adversarial.sh arc ARC-Challenge 1If you use any part of this work, please consider citing the paper as follows:
@inproceedings{molfese2025rightanswerwrongscore,
title={Right Answer, Wrong Score: Uncovering the Inconsistencies of LLM Evaluation in Multiple-Choice Question Answering},
author={Francesco Maria Molfese and Luca Moroni and Luca Gioffrè and Alessandro Scirè and Simone Conia and Roberto Navigli},
booktitle={Findings of the Association for Computational Linguistics: ACL 2025},
pages={},
year={2025}
}The data and software are licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0.
We gratefully acknowledge the support of Future AI Research (PNRR MUR project PE0000013-FAIR).