You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi lm-eval team,
I am wondering if there are plans to add LongBench v2, Babilong, InfiniteBench, and Phonebook datasets to the evaluation tasks? These are useful for long-context LLM evaluation.
LongBench v2: 2nd-gen long-text benchmark with 20 tasks (8k–2M words), focusing on real-world deep reasoning. https://github.com/THUDM/LongBench
Babilong: Focuses on ultra-long context reasoning (up to 10M tokens) with 20 tasks, testing fact inference across extended documents.https://github.com/booydar/babilong
InfiniteBench: A long-context suite (100k+ tokens) with 12 tasks (retrieval, math, code, etc.) for real/scenario-based evaluation.https://github.com/OpenBMB/InfiniteBench