Skip to content

Conversation

@HallerPatrick
Copy link
Contributor

Hello!

this PR adds our benchmark PisaBench to the eval harness. It is a multi-lingual, multi-modal benchmark based on the PISA study. More information is provided in the task README.

Each instance contains a question and a complementary image and a set of multiple choice answers. Evaluation is done through substring matching, or by using gpt-4o-mini as a judged for each generated answer.

This task should be runnable with the model type hf-multimodal and vllm-vlm.

  • Link to the HF orga containing the dataset and a leaderboard for current models
  • Link to the paper

@CLAassistant
Copy link

CLAassistant commented Nov 17, 2025

CLA assistant check
All committers have signed the CLA.

@HallerPatrick
Copy link
Contributor Author

Sorry for the delay. Didn't see that the CLA check didn't went through. I cannot interpret the error of the Tasks Modifiedcheck. Am I missing something?

@baberabb
Copy link
Contributor

baberabb commented Dec 2, 2025

Sorry for the delay. Didn't see that the CLA check didn't went through. I cannot interpret the error of the Tasks Modifiedcheck. Am I missing something?

Could I ask you to remove the extension from lm_eval/tasks/pisa/_template.yaml. It currently expects all .yaml files to be proper configs, but this one is only used as an import in the other configs

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants