Skip to content

Commit 655718d

Browse files
jannalulubaberabb
andauthored
Longbench v2 (#3338)
* initial commit * change to acc * fix long-dialogue tasks * fix versioning * more fixes * fix naming * fix naming * more renaming * maybe a dataset fix * fix dataset and use new dataset schema * add README * fix prompt and dataset naming * lint * remove utils.py * lint * more linting * fix typo * fix naming * add longbenchv2 --------- Co-authored-by: Baber <[email protected]>
1 parent 8efef8f commit 655718d

29 files changed

+470
-186
lines changed

lm_eval/tasks/README.md

Lines changed: 187 additions & 186 deletions
Large diffs are not rendered by default.

lm_eval/tasks/longbench2/README.md

Lines changed: 81 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,81 @@
1+
# LongBench v2
2+
3+
### Paper
4+
5+
Title: `LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks`
6+
7+
Abstract: `This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.`
8+
9+
Homepage: `https://github.com/THUDM/LongBench`
10+
11+
12+
### Citation
13+
14+
```
15+
@article{bai2024longbench2,
16+
title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks},
17+
author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li},
18+
journal={arXiv preprint arXiv:2412.15204},
19+
year={2024}
20+
}
21+
```
22+
23+
### Groups, Tags, and Tasks
24+
25+
#### Groups
26+
27+
* `longbench2_single`: Single-document QA tasks requiring comprehension of documents across various domains (government, legal, literature, finance, academic, detective stories, and order of events)
28+
* `longbench2_multi`: Multi-document QA tasks requiring information synthesis and reasoning across multiple documents in government, academic, finance, and news
29+
* `longbench2_incontext`: Long in-context learning tasks including user guide comprehension, translation with examples, and many-shot learning scenarios
30+
* `longbench2_history`: Long-dialogue history understanding tasks involving agent conversations and dialogue history comprehension
31+
* `longbench2_structured`: Long structured data understanding tasks for graph and table data processing
32+
33+
#### Tags
34+
35+
* `longbench2`: Run the full benchmark with 503 multiple-choice questions (8k-2M words) testing understanding and reasoning on long-context tasks
36+
37+
#### Tasks
38+
39+
**Single-Document QA:**
40+
* `longbench2_govt_single`: Question answering from single government documents
41+
* `longbench2_legal_single`: Question answering from single legal documents
42+
* `longbench2_lit_single`: Question answering from single literature/literary documents
43+
* `longbench2_fin_single`: Question answering from single financial documents
44+
* `longbench2_academic_single`: Question answering from single academic papers and research documents
45+
* `longbench2_detective`: Question answering from detective stories requiring logical reasoning
46+
* `longbench2_event_order`: Temporal reasoning tasks about event ordering in narratives
47+
48+
**Multi-Document QA:**
49+
* `longbench2_govt_multi`: Question answering across multiple government documents
50+
* `longbench2_academic_multi`: Question answering across multiple academic papers
51+
* `longbench2_fin_multi`: Question answering across multiple financial documents
52+
* `longbench2_news_multi`: Question answering across multiple news articles
53+
54+
**Long In-context Learning:**
55+
* `longbench2_user_guide`: Comprehension and application of user guide instructions
56+
* `longbench2_translate`: Translation tasks in new languages with long examples
57+
* `longbench2_many_shot`: Few-shot learning with many examples in context
58+
59+
**Long-dialogue History Understanding:**
60+
* `longbench2_agent_history`: Understanding and reasoning over extended agent conversation histories
61+
* `longbench2_dialogue_history`: Understanding and reasoning over long dialogue exchanges
62+
63+
**Code Repository Understanding:**
64+
* `longbench2_code`: Question answering on code repositories requiring codebase comprehension
65+
66+
**Long Structured Data Understanding:**
67+
* `longbench2_graph`: Understanding and reasoning over graph-structured data
68+
* `longbench2_table`: Understanding and reasoning over tabular data
69+
70+
### Checklist
71+
72+
For adding novel benchmarks/datasets to the library:
73+
* [x] Is the task an existing benchmark in the literature?
74+
* [x] Have you referenced the original paper that introduced the task?
75+
* [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test?
76+
77+
78+
If other tasks on this dataset are already supported:
79+
* [ ] Is the "Main" variant of this task clearly denoted?
80+
* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates?
81+
* [ ] Have you noted which, if any, published evaluation setups are matched by this variant?
Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,13 @@
1+
group: longbench2
2+
task:
3+
- longbench2_history
4+
- longbench2_incontext
5+
- longbench2_multi
6+
- longbench2_single
7+
- longbench2_structured
8+
- longbench2_code
9+
aggregate_metric_list:
10+
- metric: acc
11+
weight_by_size: True
12+
metadata:
13+
version: 0.0
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: longbench2_history
2+
group_alias: "Long-dialogue History Understanding"
3+
task:
4+
- longbench2_agent_history
5+
- longbench2_dialogue_history
6+
aggregate_metric_list:
7+
- metric: acc
8+
weight_by_size: True
9+
metadata:
10+
version: 0.0
Lines changed: 11 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,11 @@
1+
group: longbench2_incontext
2+
group_alias: "Long In-context Learning"
3+
task:
4+
- longbench2_user_guide
5+
- longbench2_translate
6+
- longbench2_many_shot
7+
aggregate_metric_list:
8+
- metric: acc
9+
weight_by_size: True
10+
metadata:
11+
version: 0.0
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
group: longbench2_multi
2+
group_alias: "Multi-Document QA"
3+
task:
4+
- longbench2_govt_multi
5+
- longbench2_academic_multi
6+
- longbench2_fin_multi
7+
- longbench2_news_multi
8+
aggregate_metric_list:
9+
- metric: acc
10+
weight_by_size: True
11+
metadata:
12+
version: 0.0
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
group: longbench2_single
2+
group_alias: "Single-Document QA"
3+
task:
4+
- longbench2_govt_single
5+
- longbench2_legal_single
6+
- longbench2_lit_single
7+
- longbench2_fin_single
8+
- longbench2_event_order
9+
- longbench2_academic_single
10+
- longbench2_detective
11+
aggregate_metric_list:
12+
- metric: acc
13+
weight_by_size: True
14+
metadata:
15+
version: 0.0
Lines changed: 10 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,10 @@
1+
group: longbench2_structured
2+
group_alias: "Long Structured Data Understanding"
3+
task:
4+
- longbench2_graph
5+
- longbench2_table
6+
aggregate_metric_list:
7+
- metric: acc
8+
weight_by_size: True
9+
metadata:
10+
version: 0.0
Lines changed: 12 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,12 @@
1+
dataset_path: recursal/longbench-v2
2+
test_split: train
3+
output_type: multiple_choice
4+
doc_to_text: "Please read the following text and answer the question below.\n\n<text>\n{{context}}\n</text>\n\nWhat is the correct answer to this question: {{question.strip()}}\nChoices:\n(A) {{choices[0]}}\n(B) {{choices[1]}}\n(C) {{choices[2]}}\n(D) {{choices[3]}}\n\nAnswer:"
5+
doc_to_choice: ["A", "B", "C", "D"]
6+
doc_to_target: answer
7+
metric_list:
8+
- metric: acc
9+
aggregation: mean
10+
higher_is_better: true
11+
metadata:
12+
version: 0.0
Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,6 @@
1+
include: _longbench_common_yaml
2+
tag:
3+
- longbench2
4+
- longbench2_multi
5+
task: longbench2_academic_multi
6+
dataset_name: academic_multi

0 commit comments

Comments
 (0)