This repository contains the evaluation code and data for the PalmX 2025 Shared Task on Benchmarking LLMs for Arabic and Islamic Culture.
A lightweight CLI to evaluate Hugging Face causal LMs on the PalmX 2025 subtask datasets (culture and islamic) using next-token log-likelihood over letter choices (A/B/C/D).
Datasets on HF Hub:
UBC-NLP/palmx_2025_subtask1_cultureUBC-NLP/palmx_2025_subtask2_islamic
# (optional) create & activate a virtualenv first
pip install -r requirements.txtIf your model is gated (e.g., some Llama variants or a private model), login first:
huggingface-cli login
python run_evaluation.py \
--model_name UBC-NLP/NileChat-3B \
--subtask culture \
--phase dev \
--batch_size 8 \
--predictions_file predictions.csv \
--log_outputs
--model_name: HF model id or a local path to a directory containing weights & tokenizer.--subtask:cultureorislamic.--phase: dataset split,devortest.--batch_size: batch size used for scoring the choices (default:8).--predictions_file: path to save predictions CSV (default:predictions.csv).--log_outputs: if provided, writes a detailed per-item CSV tooutputs_log.csv(customizable via--log_file).--scores_file: path where the final accuracy is written asaccuracy=<float>(default:scores.txt).
predictions.csv— two columns:id,prediction(predicted letter label, e.g.,A,B, ...).scores.txt— one line with the final accuracy inkey=valueformat, e.g.:accuracy=0.873500outputs_log.csv(when--log_outputsis set) — per-item details including question/choices, per-choice scores and probabilities, ground-truth, and correctness.
For each MCQ item, the script formats a prompt like:
{question}
A. ...
B. ...
C. ...
D. ...
الجواب:
Then it scores the log-likelihood of the next token being A, B, C, or D. It uses a numerically-stable softmax over these log-likelihoods to produce per-choice probabilities and picks the argmax as the predicted label.
- GPU is auto-detected. If you run out of memory, reduce
--batch_sizeor try a smaller model. - If your tokenizer has no pad token, we set it to EOS to allow batching with padding.
- Some models may need
--trust_remote_codeor different precision; customizepalmx_eval/processor.pyif needed.
If you use this dataset or code in your research, please cite:
@misc{alwajih2025palmx2025sharedtask,
title={PalmX 2025: The First Shared Task on Benchmarking LLMs on Arabic and Islamic Culture},
author={Fakhraddin Alwajih and Abdellah El Mekki and Hamdy Mubarak and Majd Hawasly and Abubakr Mohamed and Muhammad Abdul-Mageed},
year={2025},
eprint={2509.02550},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2509.02550},
}And the Original Palm Dataset paper:
@inproceedings{alwajih-etal-2025-palm,
title = "Palm: A Culturally Inclusive and Linguistically Diverse Dataset for {A}rabic {LLM}s",
author = "Alwajih, Fakhraddin and
El Mekki, Abdellah and
Magdy, Samar Mohamed and
Elmadany, AbdelRahim A. and
Nacar, Omer and
Nagoudi, El Moatez Billah and
Abdel-Salam, Reem and
Atwany, Hanin and
Nafea, Youssef and
others",
editor = "Che, Wanxiang and
Nabende, Joyce and
Shutova, Ekaterina and
Pilehvar, Mohammad Taher",
booktitle = "Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)",
month = jul,
year = "2025",
address = "Vienna, Austria",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.acl-long.1579/",
doi = "10.18653/v1/2025.acl-long.1579",
pages = "32871--32894",
ISBN = "979-8-89176-251-0"
}
This project is licensed under the CC-BY-NC-ND-4.0 License.
For questions or feedback, please open an issue on this repository.