elephantmemory

A benchmark harness comparing how LLM memory frameworks behave on the failure modes developers actually hit in production: stale-fact handling, multi-tenant isolation, deletion compliance, write cost, and concurrency — not just the read-only QA accuracy that existing benchmarks (LoCoMo, LongMemEval) measure.

What it tests

A simulated personal assistant ("Atlas") and a second user ("Brooke") talk to a memory-backed agent across many sessions covering travel, health, work, and personal life. Each scenario probes one of eight failure modes:

Category	What it probes
recall	Sanity baseline — facts stated once, queried later
temporal_supersede	New fact replaces old fact; query must return the new one
implicit_update	Fact derived from behavior, not stated
cross_session_reasoning	Answer joins facts from sessions far apart
abstention	Was something never said? Hallucination check
forget	Erasure request; future queries must fail
isolation	User B asks about User A's facts. Must fail.
concurrency	Parallel writers to same fact. Final state predictable?

Frameworks (Tier 1)

Adapter	Status	Memory model
`pgvector_diy`	✅ functional	Vanilla embeddings + LLM-extracted facts in Postgres. The "no framework" baseline.
`claude_memory`	✅ functional	Anthropic memory tool — agent curates a filesystem via `BetaLocalFilesystemMemoryTool`
`mem0`	✅ functional	Vector-extracted facts; configured to share the pgvector backend with the DIY baseline
`zep`	✅ functional	Bitemporal knowledge graph via graphiti-core OSS against Neo4j
`letta`	✅ functional	Agent self-edits OS-style memory blocks; one agent per user against self-hosted Letta server

Tier 2 (LangMem, LlamaIndex, Cognee, Mastra, AWS AgentCore) deferred until Tier 1 results are stable.

Decisions

Assistant + judge model: Claude Sonnet 4.6 for both. Same family avoids provider home-field bias.
Embeddings: text-embedding-3-small (OpenAI). Most adapters default to OpenAI embeddings; using one model across all adapters keeps retrieval comparable.
Hosting: self-host every framework. No Mem0 Platform / Zep Cloud — managed perf is uneven and paywalled.
LLM calls: real API calls for the assistant turn (honest latency/cost). Judge calls cached by (probe_id, response_hash).
Scenarios: hand-authored to fit the Atlas application. Not seeded from LongMemEval — the gaps in existing benchmarks are exactly what we're trying to fill.
Two users: Atlas + Brooke. Required for isolation tests.

Quickstart

cp .env.example .env  # fill in ANTHROPIC_API_KEY and OPENAI_API_KEY
docker compose up -d postgres
pip install -e ".[dev]"
elephantmemory run --adapter pgvector_diy --scenarios scenarios/
elephantmemory report results/runs/<run_id>

To enable additional adapters:

pip install -e ".[mem0]"     # mem0
pip install -e ".[zep]"      # graphiti-core
pip install -e ".[letta]"    # letta-client; also: docker compose up -d letta

Layout

elephantmemory/
  types.py            # Scenario, Session, Probe, WriteResult, QueryResult
  llm.py              # Anthropic + OpenAI clients with cost tracking
  cost.py             # Token → $ pricing tables
  scoring.py          # exact / contains / must_not_contain
  judge.py            # LLM-as-judge with sqlite cache
  runner.py           # Replays scenario events, captures metrics
  scenarios.py        # YAML loader
  cli.py              # `elephantmemory run|report`
  adapters/
    base.py           # MemoryAdapter Protocol
    pgvector_diy.py
    claude_memory.py  # skeleton
    mem0_adapter.py   # skeleton
    zep_adapter.py    # skeleton
    letta_adapter.py  # skeleton
scenarios/            # YAML scenarios (one per file)
results/
  runs/               # raw per-run output (gitignored)
  snapshots/          # canonical results checked in

Adding a new adapter

Implement MemoryAdapter from elephantmemory/adapters/base.py. The runner calls setup once, then for each scenario event in timestamp order calls one of record_session, query, or forget, then stats at the end.

Keep adapter code thin — push framework-specific config into __init__ kwargs so the runner can stay framework-agnostic.

Running on Modal (no local docker)

If you don't want to run Postgres locally, run the benchmark on Modal. One-time setup:

pip install -e ".[mem0,modal]"
modal token new
modal secret create elephantmemory \
    ANTHROPIC_API_KEY=... OPENAI_API_KEY=...

Then:

modal run modal_app.py --adapters pgvector_diy,mem0,claude_memory

The function boots Postgres (with pgvector built from source at image-build time, so it's cached after first build), runs the benchmark, persists results/runs/<run_id>/ to a Modal Volume, and the local entrypoint pulls results.json back to results/runs/modal_<run_id>/.

Three functions, one per infra profile:

Function	Adapters	Image	RAM
`run_lite`	pgvector_diy, claude_memory, mem0	postgres + pgvector	4 GB
`run_zep`	zep	OpenJDK 17 + Neo4j 5.20	8 GB
`run_letta`	letta	letta server (pip) + sqlite	4 GB

The local entrypoint fans out across all three in parallel via Function.spawn(...) — a full 5-adapter run is roughly the wall-clock of the slowest profile (usually run_letta).

modal run modal_app.py --adapters pgvector_diy,mem0,claude_memory,zep,letta

Results land in results/runs/modal_<profile>_<run_id>/.

Weekly regression + trend report

Once you've deployed the app, a weekly_regression function fires every Sunday at 06:00 UTC, runs all 5 adapters across the full corpus, and lands results in the shared Volume:

modal deploy modal_app.py

To pull a markdown report comparing every run in the volume's history, flagging probes that flipped between the last two runs and any (adapter, category) cell whose pass rate dropped >10% over the last 5 runs:

modal run modal_app.py::report
# → results/trend_report.md

Running multiple adapters at once

docker compose up -d postgres neo4j letta
elephantmemory run \
  --adapter pgvector_diy \
  --adapter claude_memory \
  --adapter mem0 \
  --adapter zep \
  --adapter letta \
  --scenarios scenarios/

Estimated cost for the 8-scenario starter set across all 5 adapters at Sonnet 4.6 prices: ~$1–3 per full run, depending on how chatty Letta gets internally and how many extraction calls mem0/Zep make. Add the dependencies you actually want with pip install -e ".[mem0,zep,letta]".

Status

v0.1 — all five Tier 1 adapters functional. Awaiting first canonical results snapshot (pending postgres + neo4j + letta containers running with real API keys); will land under results/snapshots/ and be referenced from the report.

Name		Name	Last commit message	Last commit date
Latest commit History 22 Commits
elephantmemory		elephantmemory
scenarios		scenarios
.env.example		.env.example
.gitignore		.gitignore
README.md		README.md
docker-compose.yml		docker-compose.yml
modal_app.py		modal_app.py
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

elephantmemory

What it tests

Frameworks (Tier 1)

Decisions

Quickstart

Layout

Adding a new adapter

Running on Modal (no local docker)

Weekly regression + trend report

Running multiple adapters at once

Status

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

elephantmemory

What it tests

Frameworks (Tier 1)

Decisions

Quickstart

Layout

Adding a new adapter

Running on Modal (no local docker)

Weekly regression + trend report

Running multiple adapters at once

Status

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages