|
| 1 | +# LongBench v2 |
| 2 | + |
| 3 | +### Paper |
| 4 | + |
| 5 | +Title: `LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks` |
| 6 | + |
| 7 | +Abstract: `This paper introduces LongBench v2, a benchmark designed to assess the ability of LLMs to handle long-context problems requiring deep understanding and reasoning across real-world multitasks. LongBench v2 consists of 503 challenging multiple-choice questions, with contexts ranging from 8k to 2M words, across six major task categories: single-document QA, multi-document QA, long in-context learning, long-dialogue history understanding, code repository understanding, and long structured data understanding. To ensure the breadth and the practicality, we collect data from nearly 100 highly educated individuals with diverse professional backgrounds. We employ both automated and manual review processes to maintain high quality and difficulty, resulting in human experts achieving only 53.7% accuracy under a 15-minute time constraint. Our evaluation reveals that the best-performing model, when directly answers the questions, achieves only 50.1% accuracy. In contrast, the o1-preview model, which includes longer reasoning, achieves 57.7%, surpassing the human baseline by 4%. These results highlight the importance of enhanced reasoning ability and scaling inference-time compute to tackle the long-context challenges in LongBench v2.` |
| 8 | + |
| 9 | +Homepage: `https://github.com/THUDM/LongBench` |
| 10 | + |
| 11 | + |
| 12 | +### Citation |
| 13 | + |
| 14 | +``` |
| 15 | +@article{bai2024longbench2, |
| 16 | + title={LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks}, |
| 17 | + author={Yushi Bai and Shangqing Tu and Jiajie Zhang and Hao Peng and Xiaozhi Wang and Xin Lv and Shulin Cao and Jiazheng Xu and Lei Hou and Yuxiao Dong and Jie Tang and Juanzi Li}, |
| 18 | + journal={arXiv preprint arXiv:2412.15204}, |
| 19 | + year={2024} |
| 20 | +} |
| 21 | +``` |
| 22 | + |
| 23 | +### Groups, Tags, and Tasks |
| 24 | + |
| 25 | +#### Groups |
| 26 | + |
| 27 | +* `longbench2_single`: Single-document QA tasks requiring comprehension of documents across various domains (government, legal, literature, finance, academic, detective stories, and order of events) |
| 28 | +* `longbench2_multi`: Multi-document QA tasks requiring information synthesis and reasoning across multiple documents in government, academic, finance, and news |
| 29 | +* `longbench2_incontext`: Long in-context learning tasks including user guide comprehension, translation with examples, and many-shot learning scenarios |
| 30 | +* `longbench2_history`: Long-dialogue history understanding tasks involving agent conversations and dialogue history comprehension |
| 31 | +* `longbench2_structured`: Long structured data understanding tasks for graph and table data processing |
| 32 | + |
| 33 | +#### Tags |
| 34 | + |
| 35 | +* `longbench2`: Run the full benchmark with 503 multiple-choice questions (8k-2M words) testing understanding and reasoning on long-context tasks |
| 36 | + |
| 37 | +#### Tasks |
| 38 | + |
| 39 | +**Single-Document QA:** |
| 40 | +* `longbench2_govt_single`: Question answering from single government documents |
| 41 | +* `longbench2_legal_single`: Question answering from single legal documents |
| 42 | +* `longbench2_lit_single`: Question answering from single literature/literary documents |
| 43 | +* `longbench2_fin_single`: Question answering from single financial documents |
| 44 | +* `longbench2_academic_single`: Question answering from single academic papers and research documents |
| 45 | +* `longbench2_detective`: Question answering from detective stories requiring logical reasoning |
| 46 | +* `longbench2_event_order`: Temporal reasoning tasks about event ordering in narratives |
| 47 | + |
| 48 | +**Multi-Document QA:** |
| 49 | +* `longbench2_govt_multi`: Question answering across multiple government documents |
| 50 | +* `longbench2_academic_multi`: Question answering across multiple academic papers |
| 51 | +* `longbench2_fin_multi`: Question answering across multiple financial documents |
| 52 | +* `longbench2_news_multi`: Question answering across multiple news articles |
| 53 | + |
| 54 | +**Long In-context Learning:** |
| 55 | +* `longbench2_user_guide`: Comprehension and application of user guide instructions |
| 56 | +* `longbench2_translate`: Translation tasks in new languages with long examples |
| 57 | +* `longbench2_many_shot`: Few-shot learning with many examples in context |
| 58 | + |
| 59 | +**Long-dialogue History Understanding:** |
| 60 | +* `longbench2_agent_history`: Understanding and reasoning over extended agent conversation histories |
| 61 | +* `longbench2_dialogue_history`: Understanding and reasoning over long dialogue exchanges |
| 62 | + |
| 63 | +**Code Repository Understanding:** |
| 64 | +* `longbench2_code`: Question answering on code repositories requiring codebase comprehension |
| 65 | + |
| 66 | +**Long Structured Data Understanding:** |
| 67 | +* `longbench2_graph`: Understanding and reasoning over graph-structured data |
| 68 | +* `longbench2_table`: Understanding and reasoning over tabular data |
| 69 | + |
| 70 | +### Checklist |
| 71 | + |
| 72 | +For adding novel benchmarks/datasets to the library: |
| 73 | +* [x] Is the task an existing benchmark in the literature? |
| 74 | + * [x] Have you referenced the original paper that introduced the task? |
| 75 | + * [ ] If yes, does the original paper provide a reference implementation? If so, have you checked against the reference implementation and documented how to run such a test? |
| 76 | + |
| 77 | + |
| 78 | +If other tasks on this dataset are already supported: |
| 79 | +* [ ] Is the "Main" variant of this task clearly denoted? |
| 80 | +* [ ] Have you provided a short sentence in a README on what each new variant adds / evaluates? |
| 81 | +* [ ] Have you noted which, if any, published evaluation setups are matched by this variant? |
0 commit comments