🇱🇰 SinhalaMMLU

SinhalaMMLU is a benchmark dataset for evaluating multitask language understanding in Sinhala.
It aims to measure the performance of multilingual and low-resource LLMs on diverse academic and cultural domains.

Overview

Feature	Description
Language	Sinhala
Format	Multiple-choice questions (MCQs)
Entries	7,044
Subjects	30 (Humanities, Social Science, STEM, Language, Culture, etc.)
Difficulty Levels	Easy / Medium / Hard

Subjects by Domain

The SinhalaMMLU dataset includes subjects categorized under six main domains, as shown below.

Domain	Subjects
Humanities	History, Drama and Theatre, Dancing, Eastern Music, Arts, Buddhism, Catholicism, Christianity, Islam, Buddhist Civilization, Oriental Music, History of Sri Lanka, Dancing Indigenous
Social Science	Citizenship Education, Health and Physical Science, Geography, Political Science
STEM	Physics, Chemistry, Biology, Science
Language	Sinhala Language and Literature
Business Studies	Business and Accounting Studies, Entrepreneurship Studies, Economics
Other	Home Economics, Biosystems Technology, Communication and Media Studies, Design and Construction Technology, Agriculture and Food Technology

Table 1: Subjects categorized by domain in the SinhalaMMLU dataset.

Dataset Statistics

The following table shows the total number of questions and the average question and answer lengths (in characters) for each difficulty level and domain.

Group	# Questions	Question Length	Answer Length
Easy	1893	59.08	16.77
Medium	2585	100.66	24.79
Hard	2566	116.40	27.53
------------	----------------	--------------------	------------------
STEM	629	157.82	27.42
Social Science	1084	141.80	22.34
Humanities	3419	93.91	22.24
Language	397	74.19	25.65
Business Studies	477	173.39	32.99
Other	1038	108.58	28.24

Table 1: Total number of questions and average question and answer length (in characters) for each difficulty level and domain.
The overall question count is 7,044.

Evaluation

The code used for evaluating each model is located in the src/ directory, and the scripts to run these evaluations are provided in the scripts/ directory.

License

The SinhalaMMLU dataset is released under the Creative Commons Attribution–NonCommercial–NoDerivatives 4.0 International (CC BY-NC-ND 4.0) license, which prevents users from altering the instances in the dataset.

How to cite

@inproceedings{pramodya-etal-2025-sinhalammlu,
    title = "{S}inhala{MMLU}: A Comprehensive Benchmark for Evaluating Multitask Language Understanding in {S}inhala",
    author = "Pramodya, Ashmari  and Nelki, Nirasha  and Shalinda, Heshan  and Liyanage, Chamila  and Sakai, Yusuke  and
      Pushpananda, Randil  and Weerasinghe, Ruvan  and Kamigaito, Hidetaka  and Watanabe, Taro",
    editor = "Christodoulopoulos, Christos  and
      Chakraborty, Tanmoy  and
      Rose, Carolyn  and
      Peng, Violet",
    booktitle = "Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing",
    month = nov,
    year = "2025",
    address = "Suzhou, China",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.emnlp-main.1673/",
    pages = "32931--32949",
    ISBN = "979-8-89176-332-6"
}

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
script		script
src		src
Readme.md		Readme.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🇱🇰 SinhalaMMLU

Overview

Subjects by Domain

Dataset Statistics

Evaluation

License

How to cite

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

🇱🇰 SinhalaMMLU

Overview

Subjects by Domain

Dataset Statistics

Evaluation

License

How to cite

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages