This repository contains the evaluation code and data for the PalmX 2025 Shared Task on Benchmarking LLMs for Arabic and Islamic Culture.
A lightweight CLI to evaluate Hugging Face causal LMs on the PalmX 2025 subtask datasets (culture and islamic) using next-token log-likelihood over letter choices (A/B/C/D).
Datasets on HF Hub:
UBC-NLP/palmx_2025_subtask1_culture
UBC-NLP/palmx_2025_subtask2_islamic
# (optional) create & activate a virtualenv first
pip install -r requirements.txt
If your model is gated (e.g., some Llama variants or a private model), login first:
huggingface-cli login
python run_evaluation.py \
--model_name UBC-NLP/NileChat-3B \
--subtask culture \
--phase dev \
--batch_size 8 \
--predictions_file predictions.csv \
--log_outputs
--model_name
: HF model id or a local path to a directory containing weights & tokenizer.--subtask
:culture
orislamic
.--phase
: dataset split,dev
ortest
.--batch_size
: batch size used for scoring the choices (default:8
).--predictions_file
: path to save predictions CSV (default:predictions.csv
).--log_outputs
: if provided, writes a detailed per-item CSV tooutputs_log.csv
(customizable via--log_file
).--scores_file
: path where the final accuracy is written asaccuracy=<float>
(default:scores.txt
).
predictions.csv
— two columns:id
,prediction
(predicted letter label, e.g.,A
,B
, ...).scores.txt
— one line with the final accuracy inkey=value
format, e.g.:accuracy=0.873500
outputs_log.csv
(when--log_outputs
is set) — per-item details including question/choices, per-choice scores and probabilities, ground-truth, and correctness.
For each MCQ item, the script formats a prompt like:
{question}
A. ...
B. ...
C. ...
D. ...
الجواب:
Then it scores the log-likelihood of the next token being A
, B
, C
, or D
. It uses a numerically-stable softmax over these log-likelihoods to produce per-choice probabilities and picks the argmax as the predicted label.
- GPU is auto-detected. If you run out of memory, reduce
--batch_size
or try a smaller model. - If your tokenizer has no pad token, we set it to EOS to allow batching with padding.
- Some models may need
--trust_remote_code
or different precision; customizepalmx_eval/processor.py
if needed.
If you use this dataset or code in your research, please cite:
@misc{alwajih2025palmx2025sharedtask,
title={PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture},
author={Fakhraddin Alwajih and Abdellah El Mekki and Hamdy Mubarak and Majd Hawasly and Abubakr Mohamed and Muhammad Abdul-Mageed},
year={2025},
eprint={2509.02550},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.02550},
}
And the Original Palm Dataset paper:
@inproceedings{alwajih-etal-2025-palm,
title = "Palm: A Culturally Inclusive and Linguistically Diverse Dataset for {A}rabic {LLM}s",
author = "Alwajih, Fakhraddin and
El Mekki, Abdellah and
Magdy, Samar Mohamed and
Elmadany, AbdelRahim A. and
Nacar, Omer and
Nagoudi, El Moatez Billah and
Abdel-Salam, Reem and
Atwany, Hanin and
Nafea, Youssef and
others",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1579/",
doi = "10.18653/v1/2025.acl-long.1579",
pages = "32871--32894",
ISBN = "979-8-89176-251-0"
}
This project is licensed under the CC-BY-NC-ND-4.0 License.
For questions or feedback, please open an issue on this repository.