Skip to content

Commit 57e201d

Browse files
authored
add mmmlu (#1235)
1 parent eca452f commit 57e201d

File tree

9 files changed

+732
-0
lines changed

9 files changed

+732
-0
lines changed

docs/en/benchmarks/mmmlu.md

Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
# MMMLU
2+
3+
4+
## Overview
5+
6+
MMMLU (Multilingual Massive Multitask Language Understanding) is a multilingual extension of the MMLU benchmark. It evaluates the multilingual knowledge and reasoning capabilities of language models across 14 languages, covering 57 subjects from the original MMLU benchmark.
7+
8+
## Task Description
9+
10+
- **Task Type**: Multilingual Multiple-Choice Question Answering
11+
- **Input**: Question with four answer choices (A, B, C, D) in one of 14 languages
12+
- **Output**: Single correct answer letter
13+
- **Languages**: Arabic, Bengali, German, Spanish, French, Hindi, Indonesian, Italian, Japanese, Korean, Portuguese, Swahili, Yoruba, Chinese
14+
- **Subjects**: 57 subjects from MMLU (STEM, Humanities, Social Sciences, Other)
15+
16+
## Key Features
17+
18+
- Multilingual translation of the full MMLU benchmark
19+
- 14 typologically diverse languages covering major language families
20+
- Tests cross-lingual knowledge transfer and multilingual reasoning
21+
- Same subject coverage as original MMLU (57 subjects)
22+
- Includes low-resource languages (e.g., Swahili, Yoruba)
23+
24+
## Evaluation Notes
25+
26+
- Default configuration uses **0-shot** evaluation (test split only)
27+
- Use `subset_list` to evaluate specific languages (e.g., `['ZH_CN', 'JA_JP', 'FR_FR']`)
28+
- Results are grouped by language subset
29+
- Cross-lingual performance comparison supported
30+
31+
32+
## Properties
33+
34+
| Property | Value |
35+
|----------|-------|
36+
| **Benchmark Name** | `mmmlu` |
37+
| **Dataset ID** | [openai-mirror/MMMLU](https://modelscope.cn/datasets/openai-mirror/MMMLU/summary) |
38+
| **Paper** | N/A |
39+
| **Tags** | `Knowledge`, `MCQ`, `MultiLingual` |
40+
| **Metrics** | `acc` |
41+
| **Default Shots** | 0-shot |
42+
| **Evaluation Split** | `test` |
43+
44+
45+
## Data Statistics
46+
47+
| Metric | Value |
48+
|--------|-------|
49+
| Total Samples | 196,588 |
50+
| Prompt Length (Mean) | 624.75 chars |
51+
| Prompt Length (Min/Max) | 136 / 5975 chars |
52+
53+
**Per-Subset Statistics:**
54+
55+
| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max |
56+
|--------|---------|-------------|------------|------------|
57+
| `AR_XY` | 14,042 | 584.94 | 231 | 4735 |
58+
| `BN_BD` | 14,042 | 654.99 | 247 | 4914 |
59+
| `DE_DE` | 14,042 | 791.64 | 294 | 5657 |
60+
| `ES_LA` | 14,042 | 753.18 | 271 | 5791 |
61+
| `FR_FR` | 14,042 | 777.82 | 278 | 5952 |
62+
| `HI_IN` | 14,042 | 675.02 | 256 | 5379 |
63+
| `ID_ID` | 14,042 | 726.51 | 270 | 5539 |
64+
| `IT_IT` | 14,042 | 761.19 | 277 | 5975 |
65+
| `JA_JP` | 14,042 | 322.79 | 149 | 2064 |
66+
| `KO_KR` | 14,042 | 354.35 | 153 | 2345 |
67+
| `PT_BR` | 14,042 | 706.79 | 258 | 5635 |
68+
| `SW_KE` | 14,042 | 699.08 | 259 | 5566 |
69+
| `YO_NG` | 14,042 | 681.01 | 248 | 5644 |
70+
| `ZH_CN` | 14,042 | 257.15 | 136 | 1495 |
71+
72+
## Sample Example
73+
74+
**Subset**: `AR_XY`
75+
76+
```json
77+
{
78+
"input": [
79+
{
80+
"id": "e43faf14",
81+
"content": "أجب على سؤال الاختيار من متعدد التالي. يجب أن يكون السطر الأخير من إجابتك بالتنسيق التالي: 'ANSWER: [LETTER]' (بدون علامات اقتباس) حيث [LETTER] هو أحد الحروف A,B,C,D. فكّر خطوة بخطوة قبل الإجابة.\n\nأوجد درجة امتداد الحقل المحدد Q(sqrt(2)، sqrt(3)، sqrt(18)) على Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
82+
}
83+
],
84+
"choices": [
85+
"0",
86+
"4",
87+
"2",
88+
"6"
89+
],
90+
"target": "B",
91+
"id": 0,
92+
"group_id": 0,
93+
"metadata": {
94+
"subject": "abstract_algebra",
95+
"language": "AR_XY"
96+
}
97+
}
98+
```
99+
100+
## Prompt Template
101+
102+
**Prompt Template:**
103+
```text
104+
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
105+
106+
{question}
107+
108+
{choices}
109+
```
110+
111+
## Usage
112+
113+
### Using CLI
114+
115+
```bash
116+
evalscope eval \
117+
--model YOUR_MODEL \
118+
--api-url OPENAI_API_COMPAT_URL \
119+
--api-key EMPTY_TOKEN \
120+
--datasets mmmlu \
121+
--limit 10 # Remove this line for formal evaluation
122+
```
123+
124+
### Using Python
125+
126+
```python
127+
from evalscope import run_task
128+
from evalscope.config import TaskConfig
129+
130+
task_cfg = TaskConfig(
131+
model='YOUR_MODEL',
132+
api_url='OPENAI_API_COMPAT_URL',
133+
api_key='EMPTY_TOKEN',
134+
datasets=['mmmlu'],
135+
dataset_args={
136+
'mmmlu': {
137+
# subset_list: ['AR_XY', 'BN_BD', 'DE_DE'] # optional, evaluate specific subsets
138+
}
139+
},
140+
limit=10, # Remove this line for formal evaluation
141+
)
142+
143+
run_task(task_cfg=task_cfg)
144+
```
145+
146+

docs/en/get_started/supported_dataset/llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
7474
| `mmlu` | [MMLU](../../benchmarks/mmlu.md) | `Knowledge`, `MCQ` |
7575
| `mmlu_pro` | [MMLU-Pro](../../benchmarks/mmlu_pro.md) | `Knowledge`, `MCQ` |
7676
| `mmlu_redux` | [MMLU-Redux](../../benchmarks/mmlu_redux.md) | `Knowledge`, `MCQ` |
77+
| `mmmlu` | [MMMLU](../../benchmarks/mmmlu.md) | `Knowledge`, `MCQ`, `MultiLingual` |
7778
| `mri_mcqa` | [MRI-MCQA](../../benchmarks/mri_mcqa.md) | `Knowledge`, `MCQ`, `Medical` |
7879
| `multi_if` | [Multi-IF](../../benchmarks/multi_if.md) | `InstructionFollowing`, `MultiLingual`, `MultiTurn` |
7980
| `multi_nerd` | [MultiNERD](../../benchmarks/multi_nerd.md) | `Knowledge`, `NER` |
@@ -184,6 +185,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
184185
../../benchmarks/mmlu.md
185186
../../benchmarks/mmlu_pro.md
186187
../../benchmarks/mmlu_redux.md
188+
../../benchmarks/mmmlu.md
187189
../../benchmarks/mri_mcqa.md
188190
../../benchmarks/multi_if.md
189191
../../benchmarks/multi_nerd.md

docs/zh/benchmarks/mmmlu.md

Lines changed: 142 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,142 @@
1+
# MMMLU
2+
3+
4+
## 概述
5+
6+
MMMLU(Multilingual Massive Multitask Language Understanding,多语言大规模多任务语言理解)是 MMLU 基准测试的多语言扩展版本。它在 14 种语言中评估语言模型的多语言知识与推理能力,涵盖原始 MMLU 基准中的 57 个学科。
7+
8+
## 任务描述
9+
10+
- **任务类型**:多语言多项选择题问答
11+
- **输入**:以 14 种语言之一呈现的问题,包含四个选项(A、B、C、D)
12+
- **输出**:单个正确答案字母
13+
- **语言**:阿拉伯语、孟加拉语、德语、西班牙语、法语、印地语、印尼语、意大利语、日语、韩语、葡萄牙语、斯瓦希里语、约鲁巴语、中文
14+
- **学科**:MMLU 中的 57 个学科(STEM、人文学科、社会科学、其他)
15+
16+
## 主要特点
17+
18+
- 完整 MMLU 基准的多语言翻译
19+
- 覆盖主要语系的 14 种类型学上多样化的语言
20+
- 测试跨语言知识迁移与多语言推理能力
21+
- 与原始 MMLU 相同的学科覆盖范围(57 个学科)
22+
- 包含低资源语言(例如斯瓦希里语、约鲁巴语)
23+
24+
## 评估说明
25+
26+
- 默认配置使用 **0-shot** 评估(仅测试集)
27+
- 使用 `subset_list` 评估特定语言(例如 `['ZH_CN', 'JA_JP', 'FR_FR']`
28+
- 结果按语言子集分组
29+
- 支持跨语言性能比较
30+
31+
## 属性
32+
33+
| 属性 ||
34+
|----------|-------|
35+
| **基准测试名称** | `mmmlu` |
36+
| **数据集ID** | [openai-mirror/MMMLU](https://modelscope.cn/datasets/openai-mirror/MMMLU/summary) |
37+
| **论文** | N/A |
38+
| **标签** | `Knowledge`, `MCQ`, `MultiLingual` |
39+
| **指标** | `acc` |
40+
| **默认示例数** | 0-shot |
41+
| **评估分割** | `test` |
42+
43+
## 数据统计
44+
45+
| 指标 ||
46+
|--------|-------|
47+
| 总样本数 | 196,588 |
48+
| 提示词长度(平均) | 624.75 字符 |
49+
| 提示词长度(最小/最大) | 136 / 5975 字符 |
50+
51+
**各子集统计数据:**
52+
53+
| 子集 | 样本数 | 提示平均长度 | 提示最小长度 | 提示最大长度 |
54+
|--------|---------|-------------|------------|------------|
55+
| `AR_XY` | 14,042 | 584.94 | 231 | 4735 |
56+
| `BN_BD` | 14,042 | 654.99 | 247 | 4914 |
57+
| `DE_DE` | 14,042 | 791.64 | 294 | 5657 |
58+
| `ES_LA` | 14,042 | 753.18 | 271 | 5791 |
59+
| `FR_FR` | 14,042 | 777.82 | 278 | 5952 |
60+
| `HI_IN` | 14,042 | 675.02 | 256 | 5379 |
61+
| `ID_ID` | 14,042 | 726.51 | 270 | 5539 |
62+
| `IT_IT` | 14,042 | 761.19 | 277 | 5975 |
63+
| `JA_JP` | 14,042 | 322.79 | 149 | 2064 |
64+
| `KO_KR` | 14,042 | 354.35 | 153 | 2345 |
65+
| `PT_BR` | 14,042 | 706.79 | 258 | 5635 |
66+
| `SW_KE` | 14,042 | 699.08 | 259 | 5566 |
67+
| `YO_NG` | 14,042 | 681.01 | 248 | 5644 |
68+
| `ZH_CN` | 14,042 | 257.15 | 136 | 1495 |
69+
70+
## 样例示例
71+
72+
**子集**: `AR_XY`
73+
74+
```json
75+
{
76+
"input": [
77+
{
78+
"id": "e43faf14",
79+
"content": "أجب على سؤال الاختيار من متعدد التالي. يجب أن يكون السطر الأخير من إجابتك بالتنسيق التالي: 'ANSWER: [LETTER]' (بدون علامات اقتباس) حيث [LETTER] هو أحد الحروف A,B,C,D. فكّر خطوة بخطوة قبل الإجابة.\n\nأوجد درجة امتداد الحقل المحدد Q(sqrt(2)، sqrt(3)، sqrt(18)) على Q.\n\nA) 0\nB) 4\nC) 2\nD) 6"
80+
}
81+
],
82+
"choices": [
83+
"0",
84+
"4",
85+
"2",
86+
"6"
87+
],
88+
"target": "B",
89+
"id": 0,
90+
"group_id": 0,
91+
"metadata": {
92+
"subject": "abstract_algebra",
93+
"language": "AR_XY"
94+
}
95+
}
96+
```
97+
98+
## 提示模板
99+
100+
**提示模板:**
101+
```text
102+
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
103+
104+
{question}
105+
106+
{choices}
107+
```
108+
109+
## 使用方法
110+
111+
### 使用 CLI
112+
113+
```bash
114+
evalscope eval \
115+
--model YOUR_MODEL \
116+
--api-url OPENAI_API_COMPAT_URL \
117+
--api-key EMPTY_TOKEN \
118+
--datasets mmmlu \
119+
--limit 10 # 正式评估时请删除此行
120+
```
121+
122+
### 使用 Python
123+
124+
```python
125+
from evalscope import run_task
126+
from evalscope.config import TaskConfig
127+
128+
task_cfg = TaskConfig(
129+
model='YOUR_MODEL',
130+
api_url='OPENAI_API_COMPAT_URL',
131+
api_key='EMPTY_TOKEN',
132+
datasets=['mmmlu'],
133+
dataset_args={
134+
'mmmlu': {
135+
# subset_list: ['AR_XY', 'BN_BD', 'DE_DE'] # 可选,用于评估特定子集
136+
}
137+
},
138+
limit=10, # 正式评估时请删除此行
139+
)
140+
141+
run_task(task_cfg=task_cfg)
142+
```

docs/zh/get_started/supported_dataset/llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@
7474
| `mmlu` | [MMLU](../../benchmarks/mmlu.md) | `Knowledge`, `MCQ` |
7575
| `mmlu_pro` | [MMLU-Pro](../../benchmarks/mmlu_pro.md) | `Knowledge`, `MCQ` |
7676
| `mmlu_redux` | [MMLU-Redux](../../benchmarks/mmlu_redux.md) | `Knowledge`, `MCQ` |
77+
| `mmmlu` | [MMMLU](../../benchmarks/mmmlu.md) | `Knowledge`, `MCQ`, `MultiLingual` |
7778
| `mri_mcqa` | [MRI-MCQA](../../benchmarks/mri_mcqa.md) | `Knowledge`, `MCQ`, `Medical` |
7879
| `multi_if` | [Multi-IF](../../benchmarks/multi_if.md) | `InstructionFollowing`, `MultiLingual`, `MultiTurn` |
7980
| `multi_nerd` | [MultiNERD](../../benchmarks/multi_nerd.md) | `Knowledge`, `NER` |
@@ -184,6 +185,7 @@
184185
../../benchmarks/mmlu.md
185186
../../benchmarks/mmlu_pro.md
186187
../../benchmarks/mmlu_redux.md
188+
../../benchmarks/mmmlu.md
187189
../../benchmarks/mri_mcqa.md
188190
../../benchmarks/multi_if.md
189191
../../benchmarks/multi_nerd.md

0 commit comments

Comments
 (0)