You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
> 💡 **Temporal reasoning accuracy improved by 159% compared to the OpenAI baseline.**
66
-
67
-
### Details of End-to-End Evaluation on LOCOMO
68
-
69
-
> [!NOTE]
70
-
> Comparison of LLM Judge Scores across five major tasks in the LOCOMO benchmark. Each bar shows the mean evaluation score judged by LLMs for a given method-task pair, with standard deviation as error bars. MemOS-0630 consistently outperforms baseline methods (LangMem, Zep, OpenAI, Mem0) across all task types, especially in multi-hop and temporal reasoning scenarios.
- We use gpt-4o-mini as the processing and judging LLM and bge-m3 as embedding model in MemOS evaluation.
68
+
- The evaluation was conducted under conditions that align various settings as closely as possible. Reproduce the results with our scripts at [`evaluation`](./evaluation).
69
+
- Check the full search and response details at huggingface https://huggingface.co/datasets/MemTensor/MemOS_eval_result.
70
+
> 💡 **MemOS outperforms all other methods (Mem0, Zep, Memobase, SuperMemory et al.) across all benchmarks!**
Copy file name to clipboardExpand all lines: evaluation/README.md
+37-4Lines changed: 37 additions & 4 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -1,6 +1,6 @@
1
1
# Evaluation Memory Framework
2
2
3
-
This repository provides tools and scripts for evaluating the LoCoMo dataset using various models and APIs.
3
+
This repository provides tools and scripts for evaluating the `LoCoMo`, `LongMemEval`, `PrefEval`, `personaMem` dataset using various models and APIs.
4
4
5
5
## Installation
6
6
@@ -21,11 +21,33 @@ This repository provides tools and scripts for evaluating the LoCoMo dataset usi
21
21
22
22
2. Copy the `configs-example/` directory to a new directory named `configs/`, and modify the configuration files inside it as needed. This directory contains model and API-specific settings.
We support `memos-api` and `memos-api-online` in our scripts.
44
+
And give unofficial implementations for the following memory frameworks:`zep`, `mem0`, `memobase`, `supermemory`, `memu`.
45
+
24
46
25
47
## Evaluation Scripts
26
48
27
49
### LoCoMo Evaluation
28
-
⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — `memos`, `mem0`, or `zep` — run the following [script](./scripts/run_locomo_eval.sh):
50
+
⚙️ To evaluate the **LoCoMo** dataset using one of the supported memory frameworks — run the following [script](./scripts/run_locomo_eval.sh):
29
51
30
52
```bash
31
53
# Edit the configuration in ./scripts/run_locomo_eval.sh
@@ -45,10 +67,21 @@ First prepare the dataset `longmemeval_s` from https://huggingface.co/datasets/x
45
67
./scripts/run_lme_eval.sh
46
68
```
47
69
48
-
### prefEval Evaluation
70
+
### PrefEval Evaluation
71
+
Downloading benchmark_dataset/filtered_inter_turns.json from https://github.com/amazon-science/PrefEval/blob/main/benchmark_dataset/filtered_inter_turns.json and save it as `./data/prefeval/filtered_inter_turns.json`.
72
+
To evaluate the **Prefeval** dataset — run the following [script](./scripts/run_prefeval_eval.sh):
73
+
74
+
```bash
75
+
# Edit the configuration in ./scripts/run_prefeval_eval.sh
76
+
# Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
77
+
./scripts/run_prefeval_eval.sh
78
+
```
49
79
50
-
### personaMem Evaluation
80
+
### PersonaMem Evaluation
51
81
get `questions_32k.csv` and `shared_contexts_32k.jsonl` from https://huggingface.co/datasets/bowen-upenn/PersonaMem and save them at `data/personamem/`
52
82
```bash
83
+
# Edit the configuration in ./scripts/run_pm_eval.sh
84
+
# Specify the model and memory backend you want to use (e.g., mem0, zep, etc.)
85
+
# If you want to use MIRIX, edit the the configuration in ./scripts/personamem/config.yaml
0 commit comments