Skip to content

Commit ade91fa

Browse files
authored
[Benchmark] Add longbench_v2 (#1237)
* update * update
1 parent 6930cbc commit ade91fa

File tree

9 files changed

+526
-2
lines changed

9 files changed

+526
-2
lines changed

docs/en/benchmarks/longbench_v2.md

Lines changed: 148 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,148 @@
1+
# LongBench-v2
2+
3+
4+
## Overview
5+
6+
LongBench v2 is a challenging benchmark for evaluating long-context understanding of large language models. It covers a wide variety of real-world tasks that require reading and comprehending long documents (ranging from a few thousand to over 2 million tokens), spanning multiple domains such as single-document QA, multi-document QA, long in-context learning, long-structured data understanding, and code repository understanding.
7+
8+
## Task Description
9+
10+
- **Task Type**: Long-Context Multiple-Choice Question Answering
11+
- **Input**: Long document context + multiple-choice question with four answer choices (A, B, C, D)
12+
- **Output**: Single correct answer letter
13+
- **Domains**: Single-Doc QA, Multi-Doc QA, Long In-Context Learning, Long Structured Data Understanding, Code Repo Understanding
14+
- **Difficulty**: Easy / Hard
15+
- **Length**: Short / Medium / Long
16+
17+
## Key Features
18+
19+
- 503 high-quality questions requiring genuine long-document understanding
20+
- Context lengths ranging from a few thousand tokens to over 2 million tokens
21+
- Questions are bilingual (English and Chinese)
22+
- Designed to require careful reading; correct answers cannot be guessed without reading the document
23+
- Covers diverse real-world application scenarios
24+
25+
## Evaluation Notes
26+
27+
- Default configuration uses **0-shot** evaluation (train split used as test set)
28+
- Primary metric: **Accuracy** (exact match on letter choice)
29+
- All four answer choices are required; no random shuffling needed
30+
- Samples are split into **3 subsets by context length**: `short`, `medium`, `long`
31+
- Use `subset_list` to evaluate specific length subsets (e.g., `['short', 'medium']`)
32+
33+
34+
## Properties
35+
36+
| Property | Value |
37+
|----------|-------|
38+
| **Benchmark Name** | `longbench_v2` |
39+
| **Dataset ID** | [ZhipuAI/LongBench-v2](https://modelscope.cn/datasets/ZhipuAI/LongBench-v2/summary) |
40+
| **Paper** | N/A |
41+
| **Tags** | `LongContext`, `MCQ`, `ReadingComprehension` |
42+
| **Metrics** | `acc` |
43+
| **Default Shots** | 0-shot |
44+
| **Evaluation Split** | `train` |
45+
46+
47+
## Data Statistics
48+
49+
| Metric | Value |
50+
|--------|-------|
51+
| Total Samples | 503 |
52+
| Prompt Length (Mean) | 872928.83 chars |
53+
| Prompt Length (Min/Max) | 49433 / 16184015 chars |
54+
55+
**Per-Subset Statistics:**
56+
57+
| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max |
58+
|--------|---------|-------------|------------|------------|
59+
| `short` | 180 | 124200.42 | 49433 | 841252 |
60+
| `medium` | 215 | 501002.72 | 172108 | 2233351 |
61+
| `long` | 108 | 2861217.94 | 720823 | 16184015 |
62+
63+
## Sample Example
64+
65+
**Subset**: `short`
66+
67+
```json
68+
{
69+
"input": [
70+
{
71+
"id": "7e9a926f",
72+
"content": "Please read the following text and answer the questions below.\n\n<text>\nContents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n......... ... [TRUNCATED 163697 chars] ... accelerate the construction of intelligent courts.\nC) Improve the work ability of office staff and strengthen the reserve of work knowledge.\nD) Use advanced information systems to improve the level of information technology in case handling."
73+
}
74+
],
75+
"choices": [
76+
"Through technology empowerment, change the way of working and improve office efficiency.",
77+
"Establish new types of courts, such as intellectual property courts, financial courts, and Internet courts, and accelerate the construction of intelligent courts.",
78+
"Improve the work ability of office staff and strengthen the reserve of work knowledge.",
79+
"Use advanced information systems to improve the level of information technology in case handling."
80+
],
81+
"target": "D",
82+
"id": 0,
83+
"group_id": 0,
84+
"subset_key": "short",
85+
"metadata": {
86+
"domain": "Single-Document QA",
87+
"sub_domain": "Financial",
88+
"difficulty": "easy",
89+
"length": "short",
90+
"context": "Contents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n.................................... 68\nII. Fully Implementing the Judicial Acco ... [TRUNCATED 162872 chars] ... a better environment for socialist rule of \nlaw, advance the judicial civilization to a higher level, and strive to make the \npeople obtain fair and just outcomes in every judicial case.\n法院的司法改革(2013-2018).indd 161\n2019/03/01,星期五 17:42:05",
91+
"_id": "66f36490821e116aacb2cc22"
92+
}
93+
}
94+
```
95+
96+
## Prompt Template
97+
98+
**Prompt Template:**
99+
```text
100+
Please read the following text and answer the questions below.
101+
102+
<text>
103+
{document}
104+
</text>
105+
106+
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
107+
108+
{question}
109+
110+
{choices}
111+
```
112+
113+
## Usage
114+
115+
### Using CLI
116+
117+
```bash
118+
evalscope eval \
119+
--model YOUR_MODEL \
120+
--api-url OPENAI_API_COMPAT_URL \
121+
--api-key EMPTY_TOKEN \
122+
--datasets longbench_v2 \
123+
--limit 10 # Remove this line for formal evaluation
124+
```
125+
126+
### Using Python
127+
128+
```python
129+
from evalscope import run_task
130+
from evalscope.config import TaskConfig
131+
132+
task_cfg = TaskConfig(
133+
model='YOUR_MODEL',
134+
api_url='OPENAI_API_COMPAT_URL',
135+
api_key='EMPTY_TOKEN',
136+
datasets=['longbench_v2'],
137+
dataset_args={
138+
'longbench_v2': {
139+
# subset_list: ['short', 'medium', 'long'] # optional, evaluate specific subsets
140+
}
141+
},
142+
limit=10, # Remove this line for formal evaluation
143+
)
144+
145+
run_task(task_cfg=task_cfg)
146+
```
147+
148+

docs/en/get_started/supported_dataset/llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
6161
| `jnlpba_rare` | [JNLPBA-Rare](../../benchmarks/jnlpba_rare.md) | `Knowledge`, `NER` |
6262
| `live_code_bench` | [Live-Code-Bench](../../benchmarks/live_code_bench.md) | `Coding` |
6363
| `logi_qa` | [LogiQA](../../benchmarks/logi_qa.md) | `MCQ`, `Reasoning` |
64+
| `longbench_v2` | [LongBench-v2](../../benchmarks/longbench_v2.md) | `LongContext`, `MCQ`, `ReadingComprehension` |
6465
| `maritime_bench` | [MaritimeBench](../../benchmarks/maritime_bench.md) | `Chinese`, `Knowledge`, `MCQ` |
6566
| `math_500` | [MATH-500](../../benchmarks/math_500.md) | `Math`, `Reasoning` |
6667
| `math_qa` | [MathQA](../../benchmarks/math_qa.md) | `MCQ`, `Math`, `Reasoning` |
@@ -172,6 +173,7 @@ Below is the list of supported LLM benchmarks. Click on a benchmark name for det
172173
../../benchmarks/jnlpba_rare.md
173174
../../benchmarks/live_code_bench.md
174175
../../benchmarks/logi_qa.md
176+
../../benchmarks/longbench_v2.md
175177
../../benchmarks/maritime_bench.md
176178
../../benchmarks/math_500.md
177179
../../benchmarks/math_qa.md

docs/zh/benchmarks/longbench_v2.md

Lines changed: 144 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,144 @@
1+
# LongBench-v2
2+
3+
4+
## 概述
5+
6+
LongBench v2 是一个具有挑战性的基准测试,用于评估大语言模型对长上下文的理解能力。它涵盖了多种需要阅读和理解长文档(长度从数千到超过200万token不等)的真实世界任务,涉及多个领域,包括单文档问答、多文档问答、长上下文学习、长结构化数据理解以及代码仓库理解。
7+
8+
## 任务描述
9+
10+
- **任务类型**:长上下文多项选择题问答
11+
- **输入**:长文档上下文 + 包含四个选项(A、B、C、D)的多项选择题
12+
- **输出**:单个正确答案字母
13+
- **领域**:单文档问答、多文档问答、长上下文学习、长结构化数据理解、代码仓库理解
14+
- **难度**:简单 / 困难
15+
- **长度**:短 / 中 / 长
16+
17+
## 主要特点
18+
19+
- 包含503道高质量问题,要求真正理解长文档内容
20+
- 上下文长度从数千token到超过200万token不等
21+
- 问题为双语(英文和中文)
22+
- 设计上要求仔细阅读文档;若不阅读文档则无法猜出正确答案
23+
- 覆盖多样化的现实应用场景
24+
25+
## 评估说明
26+
27+
- 默认配置使用 **0-shot** 评估(训练集用作测试集)
28+
- 主要指标:**准确率**(答案字母完全匹配)
29+
- 必须提供全部四个选项;无需随机打乱顺序
30+
- 样本按上下文长度划分为 **3个子集**`short``medium``long`
31+
- 使用 `subset_list` 可评估特定长度子集(例如 `['short', 'medium']`
32+
33+
## 属性
34+
35+
| 属性 ||
36+
|----------|-------|
37+
| **基准测试名称** | `longbench_v2` |
38+
| **数据集ID** | [ZhipuAI/LongBench-v2](https://modelscope.cn/datasets/ZhipuAI/LongBench-v2/summary) |
39+
| **论文** | N/A |
40+
| **标签** | `LongContext`, `MCQ`, `ReadingComprehension` |
41+
| **指标** | `acc` |
42+
| **默认Shots数** | 0-shot |
43+
| **评估划分** | `train` |
44+
45+
## 数据统计
46+
47+
| 指标 ||
48+
|--------|-------|
49+
| 总样本数 | 503 |
50+
| 提示词长度(平均) | 872928.83 字符 |
51+
| 提示词长度(最小/最大) | 49433 / 16184015 字符 |
52+
53+
**各子集统计数据:**
54+
55+
| 子集 | 样本数 | 提示词平均长度 | 提示词最小长度 | 提示词最大长度 |
56+
|--------|---------|-------------|------------|------------|
57+
| `short` | 180 | 124200.42 | 49433 | 841252 |
58+
| `medium` | 215 | 501002.72 | 172108 | 2233351 |
59+
| `long` | 108 | 2861217.94 | 720823 | 16184015 |
60+
61+
## 样例示例
62+
63+
**子集**`short`
64+
65+
```json
66+
{
67+
"input": [
68+
{
69+
"id": "7e9a926f",
70+
"content": "Please read the following text and answer the questions below.\n\n<text>\nContents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n......... ... [TRUNCATED 163697 chars] ... accelerate the construction of intelligent courts.\nC) Improve the work ability of office staff and strengthen the reserve of work knowledge.\nD) Use advanced information systems to improve the level of information technology in case handling."
71+
}
72+
],
73+
"choices": [
74+
"Through technology empowerment, change the way of working and improve office efficiency.",
75+
"Establish new types of courts, such as intellectual property courts, financial courts, and Internet courts, and accelerate the construction of intelligent courts.",
76+
"Improve the work ability of office staff and strengthen the reserve of work knowledge.",
77+
"Use advanced information systems to improve the level of information technology in case handling."
78+
],
79+
"target": "D",
80+
"id": 0,
81+
"group_id": 0,
82+
"subset_key": "short",
83+
"metadata": {
84+
"domain": "Single-Document QA",
85+
"sub_domain": "Financial",
86+
"difficulty": "easy",
87+
"length": "short",
88+
"context": "Contents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n.................................... 68\nII. Fully Implementing the Judicial Acco ... [TRUNCATED 162872 chars] ... a better environment for socialist rule of \nlaw, advance the judicial civilization to a higher level, and strive to make the \npeople obtain fair and just outcomes in every judicial case.\n法院的司法改革(2013-2018).indd 161\n2019/03/01,星期五 17:42:05",
89+
"_id": "66f36490821e116aacb2cc22"
90+
}
91+
}
92+
```
93+
94+
## 提示模板
95+
96+
**提示模板:**
97+
```text
98+
Please read the following text and answer the questions below.
99+
100+
<text>
101+
{document}
102+
</text>
103+
104+
Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering.
105+
106+
{question}
107+
108+
{choices}
109+
```
110+
111+
## 使用方法
112+
113+
### 使用命令行接口(CLI)
114+
115+
```bash
116+
evalscope eval \
117+
--model YOUR_MODEL \
118+
--api-url OPENAI_API_COMPAT_URL \
119+
--api-key EMPTY_TOKEN \
120+
--datasets longbench_v2 \
121+
--limit 10 # 正式评估时请删除此行
122+
```
123+
124+
### 使用Python
125+
126+
```python
127+
from evalscope import run_task
128+
from evalscope.config import TaskConfig
129+
130+
task_cfg = TaskConfig(
131+
model='YOUR_MODEL',
132+
api_url='OPENAI_API_COMPAT_URL',
133+
api_key='EMPTY_TOKEN',
134+
datasets=['longbench_v2'],
135+
dataset_args={
136+
'longbench_v2': {
137+
# subset_list: ['short', 'medium', 'long'] # 可选,用于评估特定子集
138+
}
139+
},
140+
limit=10, # 正式评估时请删除此行
141+
)
142+
143+
run_task(task_cfg=task_cfg)
144+
```

docs/zh/get_started/supported_dataset/llm.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -61,6 +61,7 @@
6161
| `jnlpba_rare` | [JNLPBA-Rare](../../benchmarks/jnlpba_rare.md) | `Knowledge`, `NER` |
6262
| `live_code_bench` | [Live-Code-Bench](../../benchmarks/live_code_bench.md) | `Coding` |
6363
| `logi_qa` | [LogiQA](../../benchmarks/logi_qa.md) | `MCQ`, `Reasoning` |
64+
| `longbench_v2` | [LongBench-v2](../../benchmarks/longbench_v2.md) | `LongContext`, `MCQ`, `ReadingComprehension` |
6465
| `maritime_bench` | [MaritimeBench](../../benchmarks/maritime_bench.md) | `Chinese`, `Knowledge`, `MCQ` |
6566
| `math_500` | [MATH-500](../../benchmarks/math_500.md) | `Math`, `Reasoning` |
6667
| `math_qa` | [MathQA](../../benchmarks/math_qa.md) | `MCQ`, `Math`, `Reasoning` |
@@ -172,6 +173,7 @@
172173
../../benchmarks/jnlpba_rare.md
173174
../../benchmarks/live_code_bench.md
174175
../../benchmarks/logi_qa.md
176+
../../benchmarks/longbench_v2.md
175177
../../benchmarks/maritime_bench.md
176178
../../benchmarks/math_500.md
177179
../../benchmarks/math_qa.md

0 commit comments

Comments
 (0)