Skip to content

Add evaluation script for memory-augmented models (A-Mem, Mem0, etc.) on LongBench & LongBench v2 #132

@xcc-zach

Description

@xcc-zach

Hello, there! I appreciate much for your great work. Below is a potiential improvement for the benchmark to be more universal!

Background

LongBench and LongBench v2 are now standard long-context benchmarks, but the official repo only measures models that read the entire context at once ([GitHub]1, [longbench2.github.io]2). Memory-centric methods such as A-Mem ([arXiv]3) and Mem0 ([mem0.ai]4) process documents incrementally with external memory, so they cannot be fairly compared using the current scripts.

Feature request

Add a built-in evaluation pipeline (e.g. memory_eval.py) that

  1. Streams each task context in fixed-size chunks to a user-supplied MemoryWrapper.
  2. Lets the wrapper retrieve/update memories and call the model to answer the query.
  3. Emits results in the same JSON format accepted by the LongBench leaderboard.

Minimal interface example:

class MemoryWrapper:
    def reset(self): ...
    def feed(self, chunk: str): ...
    def answer(self, query: str) -> str: ...

This would enable direct benchmarking of A-Mem, Mem0 and similar frameworks alongside vanilla long-context LLMs, without extra glue code.

References

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions