Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2

Hello, there! I appreciate much for your great work. Below is a potiential improvement for the benchmark to be more universal!

### Background

LongBench and LongBench v2 are now standard long-context benchmarks, but the official repo only measures models that read the *entire* context at once ([[GitHub](https://github.com/THUDM/LongBench?utm_source=chatgpt.com)][1], [[longbench2.github.io](https://longbench2.github.io/?utm_source=chatgpt.com)][2]). Memory-centric methods such as **A-Mem** ([[arXiv](https://arxiv.org/abs/2502.12110?utm_source=chatgpt.com)][3]) and **Mem0** ([[mem0.ai](https://mem0.ai/?utm_source=chatgpt.com)][4]) process documents incrementally with external memory, so they cannot be fairly compared using the current scripts.

### Feature request

Add a built-in evaluation pipeline (e.g. `memory_eval.py`) that

1. Streams each task context in fixed-size chunks to a user-supplied `MemoryWrapper`.
2. Lets the wrapper retrieve/update memories and call the model to answer the query.
3. Emits results in the same JSON format accepted by the LongBench leaderboard.

Minimal interface example:

```python
class MemoryWrapper:
    def reset(self): ...
    def feed(self, chunk: str): ...
    def answer(self, query: str) -> str: ...
```

This would enable direct benchmarking of A-Mem, Mem0 and similar frameworks alongside vanilla long-context LLMs, without extra glue code.

### References

* LongBench repo & docs ([[GitHub](https://github.com/THUDM/LongBench/blob/main/LongBench/README.md?utm_source=chatgpt.com)][5])
* LongBench v2 site & paper ([[arXiv](https://arxiv.org/abs/2412.15204?utm_source=chatgpt.com)][6])
* A-Mem project ([[GitHub](https://github.com/agiresearch/A-mem?utm_source=chatgpt.com)][7])
* Mem0 performance report ([[mem0.ai](https://mem0.ai/research?utm_source=chatgpt.com)][8])

[1]: https://github.com/THUDM/LongBench?utm_source=chatgpt.com "LongBench v2 and LongBench (ACL 25'&24')"
[2]: https://longbench2.github.io/?utm_source=chatgpt.com "LongBench v2"
[3]: https://arxiv.org/abs/2502.12110?utm_source=chatgpt.com "A-MEM: Agentic Memory for LLM Agents"
[4]: https://mem0.ai/?utm_source=chatgpt.com "Mem0 - The Memory Layer for your AI Apps"
[5]: https://github.com/THUDM/LongBench/blob/main/LongBench/README.md?utm_source=chatgpt.com "README.md - LongBench"
[6]: https://arxiv.org/abs/2412.15204?utm_source=chatgpt.com "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks"
[7]: https://github.com/agiresearch/A-mem?utm_source=chatgpt.com "agiresearch/A-mem: A-MEM: Agentic Memory for LLM Agents"
[8]: https://mem0.ai/research?utm_source=chatgpt.com "Scalable Long-Term Memory for Production AI Agents"

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2 #132

Background

Feature request

References

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2 #132

Description

Background

Feature request

References

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions