|
| 1 | +# LongBench-v2 |
| 2 | + |
| 3 | + |
| 4 | +## Overview |
| 5 | + |
| 6 | +LongBench v2 is a challenging benchmark for evaluating long-context understanding of large language models. It covers a wide variety of real-world tasks that require reading and comprehending long documents (ranging from a few thousand to over 2 million tokens), spanning multiple domains such as single-document QA, multi-document QA, long in-context learning, long-structured data understanding, and code repository understanding. |
| 7 | + |
| 8 | +## Task Description |
| 9 | + |
| 10 | +- **Task Type**: Long-Context Multiple-Choice Question Answering |
| 11 | +- **Input**: Long document context + multiple-choice question with four answer choices (A, B, C, D) |
| 12 | +- **Output**: Single correct answer letter |
| 13 | +- **Domains**: Single-Doc QA, Multi-Doc QA, Long In-Context Learning, Long Structured Data Understanding, Code Repo Understanding |
| 14 | +- **Difficulty**: Easy / Hard |
| 15 | +- **Length**: Short / Medium / Long |
| 16 | + |
| 17 | +## Key Features |
| 18 | + |
| 19 | +- 503 high-quality questions requiring genuine long-document understanding |
| 20 | +- Context lengths ranging from a few thousand tokens to over 2 million tokens |
| 21 | +- Questions are bilingual (English and Chinese) |
| 22 | +- Designed to require careful reading; correct answers cannot be guessed without reading the document |
| 23 | +- Covers diverse real-world application scenarios |
| 24 | + |
| 25 | +## Evaluation Notes |
| 26 | + |
| 27 | +- Default configuration uses **0-shot** evaluation (train split used as test set) |
| 28 | +- Primary metric: **Accuracy** (exact match on letter choice) |
| 29 | +- All four answer choices are required; no random shuffling needed |
| 30 | +- Samples are split into **3 subsets by context length**: `short`, `medium`, `long` |
| 31 | +- Use `subset_list` to evaluate specific length subsets (e.g., `['short', 'medium']`) |
| 32 | + |
| 33 | + |
| 34 | +## Properties |
| 35 | + |
| 36 | +| Property | Value | |
| 37 | +|----------|-------| |
| 38 | +| **Benchmark Name** | `longbench_v2` | |
| 39 | +| **Dataset ID** | [ZhipuAI/LongBench-v2](https://modelscope.cn/datasets/ZhipuAI/LongBench-v2/summary) | |
| 40 | +| **Paper** | N/A | |
| 41 | +| **Tags** | `LongContext`, `MCQ`, `ReadingComprehension` | |
| 42 | +| **Metrics** | `acc` | |
| 43 | +| **Default Shots** | 0-shot | |
| 44 | +| **Evaluation Split** | `train` | |
| 45 | + |
| 46 | + |
| 47 | +## Data Statistics |
| 48 | + |
| 49 | +| Metric | Value | |
| 50 | +|--------|-------| |
| 51 | +| Total Samples | 503 | |
| 52 | +| Prompt Length (Mean) | 872928.83 chars | |
| 53 | +| Prompt Length (Min/Max) | 49433 / 16184015 chars | |
| 54 | + |
| 55 | +**Per-Subset Statistics:** |
| 56 | + |
| 57 | +| Subset | Samples | Prompt Mean | Prompt Min | Prompt Max | |
| 58 | +|--------|---------|-------------|------------|------------| |
| 59 | +| `short` | 180 | 124200.42 | 49433 | 841252 | |
| 60 | +| `medium` | 215 | 501002.72 | 172108 | 2233351 | |
| 61 | +| `long` | 108 | 2861217.94 | 720823 | 16184015 | |
| 62 | + |
| 63 | +## Sample Example |
| 64 | + |
| 65 | +**Subset**: `short` |
| 66 | + |
| 67 | +```json |
| 68 | +{ |
| 69 | + "input": [ |
| 70 | + { |
| 71 | + "id": "7e9a926f", |
| 72 | + "content": "Please read the following text and answer the questions below.\n\n<text>\nContents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n......... ... [TRUNCATED 163697 chars] ... accelerate the construction of intelligent courts.\nC) Improve the work ability of office staff and strengthen the reserve of work knowledge.\nD) Use advanced information systems to improve the level of information technology in case handling." |
| 73 | + } |
| 74 | + ], |
| 75 | + "choices": [ |
| 76 | + "Through technology empowerment, change the way of working and improve office efficiency.", |
| 77 | + "Establish new types of courts, such as intellectual property courts, financial courts, and Internet courts, and accelerate the construction of intelligent courts.", |
| 78 | + "Improve the work ability of office staff and strengthen the reserve of work knowledge.", |
| 79 | + "Use advanced information systems to improve the level of information technology in case handling." |
| 80 | + ], |
| 81 | + "target": "D", |
| 82 | + "id": 0, |
| 83 | + "group_id": 0, |
| 84 | + "subset_key": "short", |
| 85 | + "metadata": { |
| 86 | + "domain": "Single-Document QA", |
| 87 | + "sub_domain": "Financial", |
| 88 | + "difficulty": "easy", |
| 89 | + "length": "short", |
| 90 | + "context": "Contents\nPreface.\n................................................................................................ 67\nI. China’s Court System and Reform Process.\n.................................... 68\nII. Fully Implementing the Judicial Acco ... [TRUNCATED 162872 chars] ... a better environment for socialist rule of \nlaw, advance the judicial civilization to a higher level, and strive to make the \npeople obtain fair and just outcomes in every judicial case.\n法院的司法改革(2013-2018).indd 161\n2019/03/01,星期五 17:42:05", |
| 91 | + "_id": "66f36490821e116aacb2cc22" |
| 92 | + } |
| 93 | +} |
| 94 | +``` |
| 95 | + |
| 96 | +## Prompt Template |
| 97 | + |
| 98 | +**Prompt Template:** |
| 99 | +```text |
| 100 | +Please read the following text and answer the questions below. |
| 101 | +
|
| 102 | +<text> |
| 103 | +{document} |
| 104 | +</text> |
| 105 | +
|
| 106 | +Answer the following multiple choice question. The last line of your response should be of the following format: 'ANSWER: [LETTER]' (without quotes) where [LETTER] is one of {letters}. Think step by step before answering. |
| 107 | +
|
| 108 | +{question} |
| 109 | +
|
| 110 | +{choices} |
| 111 | +``` |
| 112 | + |
| 113 | +## Usage |
| 114 | + |
| 115 | +### Using CLI |
| 116 | + |
| 117 | +```bash |
| 118 | +evalscope eval \ |
| 119 | + --model YOUR_MODEL \ |
| 120 | + --api-url OPENAI_API_COMPAT_URL \ |
| 121 | + --api-key EMPTY_TOKEN \ |
| 122 | + --datasets longbench_v2 \ |
| 123 | + --limit 10 # Remove this line for formal evaluation |
| 124 | +``` |
| 125 | + |
| 126 | +### Using Python |
| 127 | + |
| 128 | +```python |
| 129 | +from evalscope import run_task |
| 130 | +from evalscope.config import TaskConfig |
| 131 | + |
| 132 | +task_cfg = TaskConfig( |
| 133 | + model='YOUR_MODEL', |
| 134 | + api_url='OPENAI_API_COMPAT_URL', |
| 135 | + api_key='EMPTY_TOKEN', |
| 136 | + datasets=['longbench_v2'], |
| 137 | + dataset_args={ |
| 138 | + 'longbench_v2': { |
| 139 | + # subset_list: ['short', 'medium', 'long'] # optional, evaluate specific subsets |
| 140 | + } |
| 141 | + }, |
| 142 | + limit=10, # Remove this line for formal evaluation |
| 143 | +) |
| 144 | + |
| 145 | +run_task(task_cfg=task_cfg) |
| 146 | +``` |
| 147 | + |
| 148 | + |
0 commit comments