|
| 1 | +# MMMLU |
| 2 | + |
| 3 | + |
| 4 | +## Overview |
| 5 | + |
| 6 | +MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual extension of the MMLU benchmark. It evaluates the multilingual knowledge and reasoning capabilities of language models across 14 languages, covering 57 subjects from the original MMLU benchmark. |
| 7 | + |
| 8 | +## Task Description |
| 9 | + |
| 10 | +- **Task Type**: Multilingual Multiple-Choice Question Answering |
| 11 | +- **Input**: Question with four answer choices (A, B, C, D) in one of 14 languages |
| 12 | +- **Output**: Single correct answer letter |
| 13 | +- **Languages**: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, Chinese |
| 14 | +- **Subjects**: 57 subjects from MMLU (STEM, Humanities, Social Sciences, Other) |
| 15 | + |
| 16 | +## Key Features |
| 17 | + |
| 18 | +- Multilingual translation of the full MMLU benchmark |
| 19 | +- 14 typologically diverse languages covering major language families |
| 20 | +- Tests cross-lingual knowledge transfer and multilingual reasoning |
| 21 | +- Same subject coverage as original MMLU (57 subjects) |
| 22 | +- Includes low-resource languages (e.g., Swahili, Yoruba) |
| 23 | + |
| 24 | +## Evaluation Notes |
| 25 | + |
| 26 | +- Default configuration uses **0-shot** evaluation (test split only) |
| 27 | +- Use `subset_list` to evaluate specific languages (e.g., `['ZH_CN', 'JA_JP', 'FR_FR']`) |
| 28 | +- Results are grouped by language subset |
| 29 | +- Cross-lingual performance comparison supported |
| 30 | + |
| 31 | + |
| 32 | +## Properties |
| 33 | + |
| 34 | +| Property | Value | |
| 35 | +|----------|-------| |
| 36 | +| **Benchmark Name** | `mmmlu` | |
| 37 | +| **Dataset ID** | [openai-mirror/MMMLU](https://modelscope.cn/datasets/openai-mirror/MMMLU/summary) | |
| 38 | +| **Paper** | N/A | |
| 39 | +| **Tags** | `Knowledge`, `MCQ`, `MultiLingual` | |
| 40 | +| **Metrics** | `acc` | |
| 41 | +| **Default Shots** | 0-shot | |
| 42 | +| **Evaluation Split** | `test` | |
| 43 | + |
| 44 | + |
| 45 | +## Data Statistics |
| 46 | + |
| 47 | +| Metric | Value | |
| 48 | +|--------|-------| |
| 49 | +| Total Samples | 196,588 | |
| 50 | +| Prompt Length (Mean) | 624.75 chars | |
| 51 | +| Prompt Length (Min/Max) | 136 / 5975 chars | |
| 52 | + |
| 53 | +**Per-Subset Statistics:** |
| 54 | + |
| 55 | +| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max | |
| 56 | +|--------|---------|-------------|------------|------------| |
| 57 | +| `AR_XY` | 14,042 | 584.94 | 231 | 4735 | |
| 58 | +| `BN_BD` | 14,042 | 654.99 | 247 | 4914 | |
| 59 | +| `DE_DE` | 14,042 | 791.64 | 294 | 5657 | |
| 60 | +| `ES_LA` | 14,042 | 753.18 | 271 | 5791 | |
| 61 | +| `FR_FR` | 14,042 | 777.82 | 278 | 5952 | |
| 62 | +| `HI_IN` | 14,042 | 675.02 | 256 | 5379 | |
| 63 | +| `ID_ID` | 14,042 | 726.51 | 270 | 5539 | |
| 64 | +| `IT_IT` | 14,042 | 761.19 | 277 | 5975 | |
| 65 | +| `JA_JP` | 14,042 | 322.79 | 149 | 2064 | |
| 66 | +| `KO_KR` | 14,042 | 354.35 | 153 | 2345 | |
| 67 | +| `PT_BR` | 14,042 | 706.79 | 258 | 5635 | |
| 68 | +| `SW_KE` | 14,042 | 699.08 | 259 | 5566 | |
| 69 | +| `YO_NG` | 14,042 | 681.01 | 248 | 5644 | |
| 70 | +| `ZH_CN` | 14,042 | 257.15 | 136 | 1495 | |
| 71 | + |
| 72 | +## Sample Example |
| 73 | + |
| 74 | +**Subset**: `AR_XY` |
| 75 | + |
| 76 | +```json |
| 77 | +{ |
| 78 | + "input": [ |
| 79 | + { |
| 80 | + "id": "e43faf14", |
| 81 | + "content": "أجب على سؤال الاختيار من متعدد التالي. يجب أن يكون السطر الأخير من إجابتك بالتنسيق التالي: 'ANSWER: [LETTER]' (بدون علامات اقتباس) حيث [LETTER] هو أحد الحروف A,B,C,D. فكّر خطوة بخطوة قبل الإجابة.\n\nأوجد درجة امتداد الحقل المحدد Q(sqrt(2)، sqrt(3)، sqrt(18)) على Q.\n\nA) 0\nB) 4\nC) 2\nD) 6" |
| 82 | + } |
| 83 | + ], |
| 84 | + "choices": [ |
| 85 | + "0", |
| 86 | + "4", |
| 87 | + "2", |
| 88 | + "6" |
| 89 | + ], |
| 90 | + "target": "B", |
| 91 | + "id": 0, |
| 92 | + "group_id": 0, |
| 93 | + "metadata": { |
| 94 | + "subject": "abstract_algebra", |
| 95 | + "language": "AR_XY" |
| 96 | + } |
| 97 | +} |
| 98 | +``` |
| 99 | + |
| 100 | +## Prompt Template |
| 101 | + |
| 102 | +**Prompt Template:** |
| 103 | +```text |
| 104 | +Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering. |
| 105 | +
|
| 106 | +{question} |
| 107 | +
|
| 108 | +{choices} |
| 109 | +``` |
| 110 | + |
| 111 | +## Usage |
| 112 | + |
| 113 | +### Using CLI |
| 114 | + |
| 115 | +```bash |
| 116 | +evalscope eval \ |
| 117 | + --model YOUR_MODEL \ |
| 118 | + --api-url OPENAI_API_COMPAT_URL \ |
| 119 | + --api-key EMPTY_TOKEN \ |
| 120 | + --datasets mmmlu \ |
| 121 | + --limit 10 # Remove this line for formal evaluation |
| 122 | +``` |
| 123 | + |
| 124 | +### Using Python |
| 125 | + |
| 126 | +```python |
| 127 | +from evalscope import run_task |
| 128 | +from evalscope.config import TaskConfig |
| 129 | + |
| 130 | +task_cfg = TaskConfig( |
| 131 | + model='YOUR_MODEL', |
| 132 | + api_url='OPENAI_API_COMPAT_URL', |
| 133 | + api_key='EMPTY_TOKEN', |
| 134 | + datasets=['mmmlu'], |
| 135 | + dataset_args={ |
| 136 | + 'mmmlu': { |
| 137 | + # subset_list: ['AR_XY', 'BN_BD', 'DE_DE'] # optional, evaluate specific subsets |
| 138 | + } |
| 139 | + }, |
| 140 | + limit=10, # Remove this line for formal evaluation |
| 141 | +) |
| 142 | + |
| 143 | +run_task(task_cfg=task_cfg) |
| 144 | +``` |
| 145 | + |
| 146 | + |
0 commit comments