EverMemOS Evaluation Framework

A unified, modular evaluation framework for benchmarking memory systems on standard datasets.

📖 Overview

Evaluation Scope

In addition to EverMemOS, this framework supports evaluation of several influential memory systems in the industry:

Mem0
MemOS
Zep
MemU

These systems were selected based on recent industry benchmarks and their prominence in global markets. Since many commercial systems have web-based optimizations not available in their open-source versions, we evaluate them through their online API interfaces to ensure fair comparison with production-grade capabilities.

Implementation

Our adapter implementations are based on:

Official open-source repositories: Mem0, MemOS, Zep on GitHub
Official documentation: Mem0, MemOS, MemU, Zep quick start guide and API documentation
Consistent methodology: All systems evaluated using the same pipeline, datasets, and metrics
Unified answer generation: All systems use GPT-4.1-mini as the answer LLM to ensure fair comparison across different memory backends

During our evaluation, we identified several issues in existing open-source reference implementations for benchmarking these systems that could negatively impact their performance. We addressed these implementation gaps to ensure each system is evaluated at its best potential:

Mem0 timezone handling: The latest version returns timestamps in PDT format in search results, requiring additional timezone conversion for accurate temporal reasoning.
MemU retrieval enhancement: While some memories are visible in the backend dashboard, the /memory/retrieve/related-memory-items API likely relies on simple vector-based retrieval, which may miss relevant context. Following the official documentation examples, we included category summaries as additional context to improve recall.
Zep API migration: Zep's official open-source evaluation implementation was based on the earlier v2 API. Since Zep has officially upgraded to v3 API, we migrated the evaluation code to v3 following the official documentation to benchmark the latest capabilities.
Zep timestamp semantics: Unlike most memory systems that record conversation timestamps, Zep records event occurrence timestamps. For example, a conversation on March 2nd mentioning "Anna ate a burger yesterday" would be timestamped March 1st, with the memory content preserving the original phrasing. Using standard answer prompts leads to significant errors on temporal questions. Zep's team provides optimized prompts in their open-source evaluation code to handle this. This informed one of our evaluation principles: each memory system uses its own official answer prompts rather than a unified prompt template, ensuring fair evaluation of each system's intended usage.

Evaluation Results

Results on Locomo

Locomo	single hop	multi hop	temporal	open domain	Overall	Average Tokens	Version	Answer LLM
Full-context	94.93	90.43	87.95	71.88	91.21	20281		gpt-4.1-mini
Mem0	68.97	61.70	58.26	50.00	64.20	1016	web API/v1.0.0 (2025.11)	gpt-4.1-mini
Zep	90.84	81.91	77.26	75.00	85.22	1411	web API/v3 (2025.11)	gpt-4.1-mini
MemOS	85.37	79.43	75.08	64.58	80.76	2498	web API/v1 (2025.11)	gpt-4.1-mini
MemU	74.91	72.34	43.61	54.17	66.67	3964	web API/v1 (2025.11)	gpt-4.1-mini
EverMemOS	96.08	91.13	89.72	70.83	92.32	2298	open-source EverMemOS v1.0.0 companion	gpt-4.1-mini

*Full-context: using the whole conversation as context for answering questions.

Results on Longmemeval

Longmemeval	Single-session-user	Single-session-assistant	Single-session-preference	Multi-session	Knowledge-update	Temporal-reasoning	Overall
EverMemOS	100.00	78.57	96.67	78.45	87.18	71.18	82.00

Note on Reproducibility: To ensure the reproducibility of our evaluation, we provide full evaluation intermediate data for all methods. You can access the data at EverMind-AI/EverMemOS_Eval_Results.

🌟 Key Features

Unified & Modular Framework

One codebase for all: No need to write separate code for each dataset or system
Plug-and-play systems: Support multiple memory systems (EverMemOS, Mem0, MemOS, MemU, etc.)
Multiple benchmarks: LoCoMo, LongMemEval, PersonaMem out of the box
Consistent evaluation: All systems evaluated with the same pipeline and metrics

Automatic Compatibility Detection

The framework automatically detects and adapts to:

Multi-user vs Single-user conversations: Handles both conversation types seamlessly
Q&A vs Multiple-choice questions: Adapts evaluation approach based on question format
With/without timestamps: Works with or without temporal information

Robust Checkpoint System

Cross-stage checkpoints: Resume from any pipeline stage (add → search → answer → evaluate)
Fine-grained resume: Saves progress every conversation (search) and every 400 questions (answer)

🏗️ Architecture Overview

Code Structure

evaluation/
├── src/
│   ├── core/           # Pipeline orchestration and data models
│   ├── adapters/       # System-specific implementations
│   ├── evaluators/     # Answer evaluation (LLM Judge, Exact Match)
│   ├── converters/     # Dataset format converters
│   └── utils/          # Configuration, logging, I/O
├── config/
│   ├── datasets/       # Dataset configurations (locomo.yaml, etc.)
│   ├── systems/        # System configurations (evermemos.yaml, etc.)
│   └── prompts.yaml    # Prompt templates
├── data/               # Benchmark datasets
└── results/            # Evaluation results and logs

Pipeline Flow

The evaluation consists of 4 sequential stages:

Add: Ingest conversations and build indexes
Search: Retrieve relevant memories for each question
Answer: Generate answers using retrieved context
Evaluate: Assess answer quality with LLM Judge or Exact Match

Each stage saves its output and can be resumed independently.

🚀 Getting Started

Prerequisites

Python 3.10+
EverMemOS environment configured (see main project's env.template)

Data Preparation

Place your dataset files in the evaluation/data/ directory:

LoCoMo (native format, no conversion needed): Get data from: https://github.com/snap-research/locomo/tree/main/data

evaluation/data/locomo/
└── locomo10.json

LongMemEval (auto-converts to LoCoMo format): Get data from: https://huggingface.co/datasets/xiaowu0162/longmemeval-cleaned

evaluation/data/longmemeval/
└── longmemeval_s_cleaned.json  # Original file
# → Will auto-generate: longmemeval_s_locomo_style.json

PersonaMem (auto-converts to LoCoMo format): Get data from: https://huggingface.co/datasets/bowen-upenn/PersonaMem

evaluation/data/personamem/
├── questions_32k.csv           # Original file
└── shared_contexts_32k.jsonl   # Original file
# → Will auto-generate: personamem_32k_locomo_style.json

The framework will automatically detect and convert non-LoCoMo formats on first run. You don't need to manually run any conversion scripts.

Installation

Install evaluation-specific dependencies:

# For evaluating local systems (EverMemOS)
uv sync --group evaluation

# For evaluating online API systems (Mem0, MemOS, MemU, etc.)
uv sync --group evaluation-full

Environment Configuration

The evaluation framework reuses most environment variables from the main EverMemOS .env file:

LLM_API_KEY, LLM_BASE_URL (for answer generation with GPT-4.1-mini)
VECTORIZE_API_KEY and RERANK_API_KEY (for embeddings/reranker)

⚠️ Important: For OpenRouter API (used by gpt-4.1-mini), make sure LLM_API_KEY is set to your OpenRouter API key (format: sk-or-v1-xxx). The system will look for API keys in this order:

Explicit api_key parameter in config
LLM_API_KEY environment variable

For testing EverMemOS, please first configure the whole .env file.

Additional variables for online API systems (add to .env if testing these systems):

# Mem0
MEM0_API_KEY=your_mem0_api_key

# MemOS
MEMOS_KEY=your_memos_api_key

# MemU
MEMU_API_KEY=your_memu_api_key

Quick Test (Smoke Test)

Run a quick test with limited data to verify everything works:

# Navigate to project root
cd /path/to/memsys-opensource

# Default: first conversation, first 10 messages, first 3 questions 
uv run python -m evaluation.cli --dataset locomo --system evermemos --smoke

# Custom: first conversation, 20 messages, 5 questions
uv run python -m evaluation.cli --dataset locomo --system evermemos \
    --smoke --smoke-messages 20 --smoke-questions 5

# You can also evaluate specific conversations with `--from-conv` and `--to-conv` (0-based, end exclusive):
uv run python -m evaluation.cli --dataset locomo --system evermemos_custom --from-conv 0 --to-conv 1

Full Evaluation

Run the complete benchmark:

# Evaluate EverMemOS on LoCoMo
uv run python -m evaluation.cli --dataset locomo --system evermemos

# Evaluate EverMemOS via local API (start server first)
uv run python src/run.py
# Use --clean-groups to clear existing data before Add stage
uv run python -m evaluation.cli --dataset locomo --system evermemos_local_api --clean-groups

# Evaluate other systems
uv run python -m evaluation.cli --dataset locomo --system memos
uv run python -m evaluation.cli --dataset locomo --system memu
# For mem0, it's recommended to run add first, check the memory status on the web console to make sure it's finished and then following stages.
uv run python -m evaluation.cli --dataset locomo --system mem0 --stages add
uv run python -m evaluation.cli --dataset locomo --system mem0 --stages search answer evaluate

# Evaluate on other datasets
uv run python -m evaluation.cli --dataset longmemeval --system evermemos
uv run python -m evaluation.cli --dataset personamem --system evermemos

# Use --run-name to distinguish multiple runs (useful for A/B testing)
# Results will be saved to: results/{dataset}-{system}-{run-name}/
uv run python -m evaluation.cli --dataset locomo --system evermemos --run-name baseline
uv run python -m evaluation.cli --dataset locomo --system evermemos --run-name experiment1
uv run python -m evaluation.cli --dataset locomo --system evermemos --run-name 20241107

# Resume from checkpoint if interrupted (automatic)
# Just re-run the same command - it will detect and resume from checkpoint
uv run python -m evaluation.cli --dataset locomo --system evermemos

View Results

Results are saved to evaluation/results/{dataset}-{system}[-{run-name}]/:

# View summary report
cat evaluation/results/locomo-evermemos/report.txt

# View detailed evaluation results
cat evaluation/results/locomo-evermemos/eval_results.json

# View pipeline execution log
cat evaluation/results/locomo-evermemos/pipeline.log

Result files:

report.txt - Summary metrics (accuracy, total questions)
eval_results.json - Per-question evaluation details
answer_results.json - Generated answers and retrieved context
search_results.json - Retrieved memories for each question
pipeline.log - Detailed execution logs

📊 Understanding Results

Metric

Accuracy: Percentage of correct answers (QA judged by LLM, multiple choice questions judged by exact match)

Detailed Results

Check eval_results.json for per-question breakdown:

LoCoMo example (Q&A format, evaluated by LLM Judge):

{
  "total_questions": ...,
  "correct": ...,
  "accuracy": ...,
  "detailed_results": {
      "locomo_exp_user_0": [
         {
            "question_id": "locomo_0_qa0",
            "question": "What is my favorite food?",
            "golden_answer": "Pizza",
            "generated_answer": "Your favorite food is pizza.",
            "judgments": [
               true,
               true,
               true
            ],
            "category": "1"
         }
         ...
      ]
  }
}

PersonaMem example (Multiple-choice format, evaluated by Exact Match):

{
  "overall_accuracy": ...,
  "total_questions": ...,
  "correct_count": ...,
  "detailed_results": [
    {
      "question_id": "acd74206-37dc-4756-94a8-b99a395d9a21",
      "question": "I recently attended an event where there was a unique blend of modern beats with Pacific sounds.",
      "golden_answer": "(c)",
      "generated_answer": "(c)",
      "is_correct": true,
      "category": "recall_user_shared_facts"
    }
    ...
  ]
}

🔧 Advanced Usage

Run Specific Stages

Skip completed stages to iterate faster:

# Only run search stage (if add is already done)
uv run python -m evaluation.cli --dataset locomo --system evermemos --stages search

# Run search, answer, and evaluate (skip add)
uv run python -m evaluation.cli --dataset locomo --system evermemos \
    --stages search answer evaluate

If you have already done search, and you want to do it again, please remove the "search" (and following stages from the completed_stages in the checkpoint_default.json file):

  "completed_stages": [
    "answer",
    "search",
    "evaluate",
    "add"
  ]

Custom Configuration

Modify system or dataset configurations:

# Copy and edit configuration
cp evaluation/config/systems/evermemos.yaml evaluation/config/systems/evermemos_custom.yaml
# Edit evermemos_custom.yaml with your changes

# Run with custom config
uv run python -m evaluation.cli --dataset locomo --system evermemos_custom

📄 License

Same as the parent project.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EverMemOS Evaluation Framework

📖 Overview

Evaluation Scope

Implementation

Evaluation Results

🌟 Key Features

Unified & Modular Framework

Automatic Compatibility Detection

Robust Checkpoint System

🏗️ Architecture Overview

Code Structure

Pipeline Flow

🚀 Getting Started

Prerequisites

Data Preparation

Installation

Environment Configuration

Quick Test (Smoke Test)

Full Evaluation

View Results

📊 Understanding Results

Metric

Detailed Results

🔧 Advanced Usage

Run Specific Stages

Custom Configuration

📄 License

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

EverMemOS Evaluation Framework

📖 Overview

Evaluation Scope

Implementation

Evaluation Results

🌟 Key Features

Unified & Modular Framework

Automatic Compatibility Detection

Robust Checkpoint System

🏗️ Architecture Overview

Code Structure

Pipeline Flow

🚀 Getting Started

Prerequisites

Data Preparation

Installation

Environment Configuration

Quick Test (Smoke Test)

Full Evaluation

View Results

📊 Understanding Results

Metric

Detailed Results

🔧 Advanced Usage

Run Specific Stages

Custom Configuration

📄 License