Add support for Titulm Bangla MMLU dataset (#3317)

Ismail-Hossain-1 · baberabb · web-flow · commit d5ddccd92051 · 2025-10-14T18:14:29.000+05:00
* Added YAML task for [task/bangla]

* brief description of the task

* Update README.md

fix

---------

Co-authored-by: Baber Abbasi &lt;92168766+baberabb@users.noreply.github.com&gt;
diff --git a/lm_eval/tasks/README.md b/lm_eval/tasks/README.md
@@ -26,6 +26,7 @@ provided to the individual README.md files for each subfolder.
 | [asdiv](asdiv/README.md)                                                 | Tasks involving arithmetic and mathematical reasoning challenges.                                                                                                                                                                                                                                                                      | English                                                                                                                                                                                                                                                       |
 | [babi](babi/README.md)                                                   | Tasks designed as question and answering challenges based on simulated stories.                                                                                                                                                                                                                                                        | English                                                                                                                                                                                                                                                       |
 | [babilong](babilong/README.md)                                           | Tasks designed to test whether models can find and reason over facts in long contexts.                                                                                                                                                                                                                                                 | English                                                                                                                                                                                                                                                       |
+| [bangla_mmlu](bangla/README.md)                                              |    Benchmark dataset for evaluating language models' performance on Bangla (Bengali) language tasks.Includes diverse NLP tasks to measure model understanding and generation capabilities in Bangla.                                                                    |                               Bengali/Bangla                                                                     |  
 | [basque_bench](basque_bench/README.md)                                   | Collection of tasks in Basque encompassing various evaluation areas.                                                                                                                                                                                                                                                                   | Basque                                                                                                                                                                                                                                                        |
 | [basqueglue](basqueglue/README.md)                                       | Tasks designed to evaluate language understanding in Basque language.                                                                                                                                                                                                                                                                  | Basque                                                                                                                                                                                                                                                        |
 | [bbh](bbh/README.md)                                                     | Tasks focused on deep semantic understanding through hypothesization and reasoning.                                                                                                                                                                                                                                                    | English, German                                                                                                                                                                                                                                               |
diff --git a/lm_eval/tasks/bangla/README.md b/lm_eval/tasks/bangla/README.md
@@ -0,0 +1,50 @@
+# Titulm Bangla MMLU
+
+This repository contains resources related to **Titulm Bangla MMLU**, a benchmark dataset designed for evaluating Bangla language models. The dataset is used for training, development, and comparative evaluation of language models in the Bangla language.
+
+---
+
+## Overview
+
+**TituLLMs** is a family of Bangla large language models (LLMs) with comprehensive benchmarking designed to advance natural language processing for the Bangla language. The benchmark dataset `Titulm Bangla MMLU` covers multiple-choice questions across a diverse range of topics in Bangla.
+
+This dataset is primarily used to train, validate, and evaluate Bangla language models and compare their performance with other existing models.
+
+For more details, please refer to the original research paper:  
+[https://arxiv.org/abs/2502.11187](https://arxiv.org/abs/2502.11187)
+
+
+---
+
+## Dataset
+
+The `Titulm Bangla MMLU` dataset can be found on Hugging Face:  
+[https://huggingface.co/datasets/hishab/titulm-bangla-mmlu](https://huggingface.co/datasets/hishab/titulm-bangla-mmlu)
+
+This dataset was used as a benchmark in the development and evaluation of TituLLMs and related models.
+
+---
+
+## Usage
+
+The dataset is intended for use within the [lm-evaluation-harness](https://github.com/EleutherAI/lm-evaluation-harness) repository to evaluate and compare the performance of Bangla language models.
+
+---
+
+## Note: The dataset can also be used to evaluate other models
+
+### Other datasets like boolq, openbookqa ... soon to be added
+## Citation
+
+If you use this dataset or model, please cite the original paper:
+
+```bibtex
+@misc{nahin2025titullmsfamilybanglallms,
+      title={TituLLMs: A Family of Bangla LLMs with Comprehensive Benchmarking},
+      author={Shahriar Kabir Nahin and Rabindra Nath Nandi and Sagor Sarker and Quazi Sarwar Muhtaseem and Md Kowsher and Apu Chandraw Shill and Md Ibrahim and Mehadi Hasan Menon and Tareq Al Muntasir and Firoj Alam},
+      year={2025},
+      eprint={2502.11187},
+      archivePrefix={arXiv},
+      primaryClass={cs.CL},
+      url={https://arxiv.org/abs/2502.11187},
+}
diff --git a/lm_eval/tasks/bangla/bangla_mmlu_test.yaml b/lm_eval/tasks/bangla/bangla_mmlu_test.yaml
@@ -0,0 +1,20 @@
+
+task: bangla_mmlu
+dataset_path: hishab/titulm-bangla-mmlu
+dataset_name: all
+description: "The following are multiple choice questions (with answers) about range of topics in Bangla"
+test_split: test
+fewshot_split: dev
+fewshot_config:
+  sampler: first_n
+output_type: multiple_choice
+doc_to_text: "{{question.strip()}} A. {{options[0]}} B. {{options[1]}} C. {{options[2]}} D. {{options[3]}} Answer:"
+doc_to_choice: ["A", "B", "C", "D"]
+doc_to_target: answer
+metric_list:
+  - metric: acc
+    aggregation: mean
+    higher_is_better: true
+  - metric: acc_norm
+    aggregation: mean
+    higher_is_better: true