Skip to content

Commit de789b2

Browse files
authored
Merge branch 'dev' into main
2 parents 656d653 + fafc747 commit de789b2

File tree

158 files changed

+19430
-2050
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

158 files changed

+19430
-2050
lines changed

.gitignore

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ evaluation/.env
1515
!evaluation/configs-example/*.json
1616
evaluation/configs/*
1717
**tree_textual_memory_locomo**
18+
**script.py**
1819
.env
1920
evaluation/scripts/personamem
2021

README.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -54,22 +54,20 @@
5454

5555
## 📈 Performance Benchmark
5656

57-
MemOS demonstrates significant improvements over baseline memory solutions in multiple reasoning tasks.
58-
59-
| Model | Avg. Score | Multi-Hop | Open Domain | Single-Hop | Temporal Reasoning |
60-
|-------------|------------|-----------|-------------|------------|---------------------|
61-
| **OpenAI** | 0.5275 | 0.6028 | 0.3299 | 0.6183 | 0.2825 |
62-
| **MemOS** | **0.7331** | **0.6430** | **0.5521** | **0.7844** | **0.7321** |
63-
| **Improvement** | **+38.98%** | **+6.67%** | **+67.35%** | **+26.86%** | **+159.15%** |
64-
65-
> 💡 **Temporal reasoning accuracy improved by 159% compared to the OpenAI baseline.**
66-
67-
### Details of End-to-End Evaluation on LOCOMO
68-
69-
> [!NOTE]
70-
> Comparison of LLM Judge Scores across five major tasks in the LOCOMO benchmark. Each bar shows the mean evaluation score judged by LLMs for a given method-task pair, with standard deviation as error bars. MemOS-0630 consistently outperforms baseline methods (LangMem, Zep, OpenAI, Mem0) across all task types, especially in multi-hop and temporal reasoning scenarios.
71-
72-
<img src="https://statics.memtensor.com.cn/memos/score_all_end2end.jpg" alt="END2END SCORE">
57+
MemOS demonstrates significant improvements over baseline memory solutions in multiple memory tasks,
58+
showcasing its capabilities in **information extraction**, **temporal and cross-session reasoning**, and **personalized preference responses**.
59+
60+
| Model | LOCOMO | LongMemEval | PrefEval-10 | PersonaMem |
61+
|-----------------|-------------|-------------|-------------|-------------|
62+
| **GPT-4o-mini** | 52.75 | 55.4 | 2.8 | 43.46 |
63+
| **MemOS** | **75.80** | **77.80** | **71.90** | **61.17** |
64+
| **Improvement** | **+43.70%** | **+40.43%** | **+2568%** | **+40.75%** |
65+
66+
### Detailed Evaluation Results
67+
- We use gpt-4o-mini as the processing and judging LLM and bge-m3 as embedding model in MemOS evaluation.
68+
- The evaluation was conducted under conditions that align various settings as closely as possible. Reproduce the results with our scripts at [`evaluation`](./evaluation).
69+
- Check the full search and response details at huggingface https://huggingface.co/datasets/MemTensor/MemOS_eval_result.
70+
> 💡 **MemOS outperforms all other methods (Mem0, Zep, Memobase, SuperMemory et al.) across all benchmarks!**
7371
7472
## ✨ Key Features
7573

docker/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -157,4 +157,4 @@ volcengine-python-sdk==4.0.6
157157
watchfiles==1.1.0
158158
websockets==15.0.1
159159
xlrd==2.0.2
160-
xlsxwriter==3.2.5
160+
xlsxwriter==3.2.5

docs/openapi.json

Lines changed: 7 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -884,7 +884,7 @@
884884
"type": "string",
885885
"title": "Session Id",
886886
"description": "Session ID for the MOS. This is used to distinguish between different dialogue",
887-
"default": "0ce84b9c-0615-4b9d-83dd-fba50537d5d3"
887+
"default": "41bb5e18-252d-4948-918c-07d82aa47086"
888888
},
889889
"chat_model": {
890890
"$ref": "#/components/schemas/LLMConfigFactory",
@@ -939,6 +939,12 @@
939939
"description": "Enable parametric memory for the MemChat",
940940
"default": false
941941
},
942+
"enable_preference_memory": {
943+
"type": "boolean",
944+
"title": "Enable Preference Memory",
945+
"description": "Enable preference memory for the MemChat",
946+
"default": false
947+
},
942948
"enable_mem_scheduler": {
943949
"type": "boolean",
944950
"title": "Enable Mem Scheduler",

evaluation/.env-example

Lines changed: 14 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -3,21 +3,27 @@ MODEL="gpt-4o-mini"
33
OPENAI_API_KEY="sk-***REDACTED***"
44
OPENAI_BASE_URL="http://***.***.***.***:3000/v1"
55

6-
MEM0_API_KEY="m0-***REDACTED***"
7-
8-
ZEP_API_KEY="z_***REDACTED***"
96

107
# response model
118
CHAT_MODEL="gpt-4o-mini"
129
CHAT_MODEL_BASE_URL="http://***.***.***.***:3000/v1"
1310
CHAT_MODEL_API_KEY="sk-***REDACTED***"
1411

12+
# memos
1513
MEMOS_KEY="Token mpg-xxxxx"
16-
MEMOS_URL="https://apigw-pre.memtensor.cn/api/openmem/v1"
17-
PRE_SPLIT_CHUNK=false # pre split chunk in client end
14+
MEMOS_URL="http://127.0.0.1:8001"
15+
MEMOS_ONLINE_URL="https://memos.memtensor.cn/api/openmem/v1"
16+
17+
# other memory agents
18+
MEM0_API_KEY="m0-xxx"
19+
ZEP_API_KEY="z_xxx"
20+
MEMU_API_KEY="mu_xxx"
21+
SUPERMEMORY_API_KEY="sm_xxx"
22+
MEMOBASE_API_KEY="xxx"
23+
MEMOBASE_PROJECT_URL="http://***.***.***.***:8019"
1824

19-
MEMOBASE_API_KEY="xxxxx"
20-
MEMOBASE_PROJECT_URL="http://xxx.xxx.xxx.xxx:8019"
25+
# eval settings
26+
PRE_SPLIT_CHUNK=false
2127

2228
# Configuration Only For Scheduler
2329
# RabbitMQ Configuration
@@ -38,4 +44,4 @@ MEMSCHEDULER_GRAPHDBAUTH_URI=bolt://localhost:7687
3844
MEMSCHEDULER_GRAPHDBAUTH_USER=neo4j
3945
MEMSCHEDULER_GRAPHDBAUTH_PASSWORD=***
4046
MEMSCHEDULER_GRAPHDBAUTH_DB_NAME=neo4j
41-
MEMSCHEDULER_GRAPHDBAUTH_AUTO_CREATE=true
47+
MEMSCHEDULER_GRAPHDBAUTH_AUTO_CREATE=true

evaluation/README.md

Lines changed: 37 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,6 @@
11
# Evaluation Memory Framework
22

3-
This repository provides tools and scripts for evaluating the LoCoMo dataset using various models and APIs.
3+
This repository provides tools and scripts for evaluating the `LoCoMo`, `LongMemEval`, `PrefEval`, `personaMem` dataset using various models and APIs.
44

55
## Installation
66

@@ -21,11 +21,33 @@ This repository provides tools and scripts for evaluating the LoCoMo dataset usi
2121

2222
2. Copy the `configs-example/` directory to a new directory named `configs/`, and modify the configuration files inside it as needed. This directory contains model and API-specific settings.
2323

24+
## Setup MemOS
25+
### local server
26+
```bash
27+
# modify {project_dir}/.env file and start server
28+
uvicorn memos.api.server_api:app --host 0.0.0.0 --port 8001 --workers 8
29+
30+
# configure {project_dir}/evaluation/.env file
31+
MEMOS_URL="http://127.0.0.1:8001"
32+
```
33+
### online service
34+
```bash
35+
# get your api key at https://memos-dashboard.openmem.net/cn/quickstart/
36+
# configure {project_dir}/evaluation/.env file
37+
MEMOS_KEY="Token mpg-xxxxx"
38+
MEMOS_ONLINE_URL="https://memos.memtensor.cn/api/openmem/v1"
39+
40+
```
41+
42+
## Supported frameworks
43+
We support `memos-api` and `memos-api-online` in our scripts.
44+
And give unofficial implementations for the following memory frameworks:`zep`, `mem0`, `memobase`, `supermemory`, `memu`.
45+
2446

2547
## Evaluation Scripts
2648

2749
### LoCoMo Evaluation
28-
⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep`run the following [script](./scripts/run_locomo_eval.sh):
50+
⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — run the following [script](./scripts/run_locomo_eval.sh):
2951

3052
```bash
3153
# Edit the configuration in ./scripts/run_locomo_eval.sh
@@ -45,10 +67,21 @@ First prepare the dataset `longmemeval_s` from https://huggingface.co/datasets/x
4567
./scripts/run_lme_eval.sh
4668
```
4769

48-
### prefEval Evaluation
70+
### PrefEval Evaluation
71+
Downloading benchmark_dataset/filtered_inter_turns.json from https://github.com/amazon-science/PrefEval/blob/main/benchmark_dataset/filtered_inter_turns.json and save it as `./data/prefeval/filtered_inter_turns.json`.
72+
To evaluate the **Prefeval** dataset — run the following [script](./scripts/run_prefeval_eval.sh):
73+
74+
```bash
75+
# Edit the configuration in ./scripts/run_prefeval_eval.sh
76+
# Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
77+
./scripts/run_prefeval_eval.sh
78+
```
4979

50-
### personaMem Evaluation
80+
### PersonaMem Evaluation
5181
get `questions_32k.csv` and `shared_contexts_32k.jsonl` from https://huggingface.co/datasets/bowen-upenn/PersonaMem and save them at `data/personamem/`
5282
```bash
83+
# Edit the configuration in ./scripts/run_pm_eval.sh
84+
# Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
85+
# If you want to use MIRIX, edit the the configuration in ./scripts/personamem/config.yaml
5386
./scripts/run_pm_eval.sh
5487
```

0 commit comments

Comments
 (0)