Plans to add new long-context benchmark like LongBench v2, Babilong, InfiniteBench datasets?

Hi lm-eval team, 
I am wondering if there are plans to add LongBench v2, Babilong, InfiniteBench, and Phonebook datasets to the evaluation tasks? These are useful for long-context LLM evaluation.

- LongBench v2: 2nd-gen long-text benchmark with 20 tasks (8k–2M words), focusing on real-world deep reasoning. https://github.com/THUDM/LongBench
- Babilong: Focuses on ultra-long context reasoning (up to 10M tokens) with 20 tasks, testing fact inference across extended documents.https://github.com/booydar/babilong
- InfiniteBench: A long-context suite (100k+ tokens) with 12 tasks (retrieval, math, code, etc.) for real/scenario-based evaluation.https://github.com/OpenBMB/InfiniteBench

Thanks!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Plans to add new long-context benchmark like LongBench v2, Babilong, InfiniteBench datasets? #3224

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Plans to add new long-context benchmark like LongBench v2, Babilong, InfiniteBench datasets? #3224

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions