Skip to content

Commit c1f1ca1

Browse files
committed
doc: Update readme eval result
1 parent f5a5744 commit c1f1ca1

File tree

1 file changed

+14
-16
lines changed

1 file changed

+14
-16
lines changed

README.md

Lines changed: 14 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -54,22 +54,20 @@
5454

5555
## 📈 Performance Benchmark
5656

57-
MemOS demonstrates significant improvements over baseline memory solutions in multiple reasoning tasks.
58-
59-
| Model | Avg. Score | Multi-Hop | Open Domain | Single-Hop | Temporal Reasoning |
60-
|-------------|------------|-----------|-------------|------------|---------------------|
61-
| **OpenAI** | 0.5275 | 0.6028 | 0.3299 | 0.6183 | 0.2825 |
62-
| **MemOS** | **0.7331** | **0.6430** | **0.5521** | **0.7844** | **0.7321** |
63-
| **Improvement** | **+38.98%** | **+6.67%** | **+67.35%** | **+26.86%** | **+159.15%** |
64-
65-
> 💡 **Temporal reasoning accuracy improved by 159% compared to the OpenAI baseline.**
66-
67-
### Details of End-to-End Evaluation on LOCOMO
68-
69-
> [!NOTE]
70-
> Comparison of LLM Judge Scores across five major tasks in the LOCOMO benchmark. Each bar shows the mean evaluation score judged by LLMs for a given method-task pair, with standard deviation as error bars. MemOS-0630 consistently outperforms baseline methods (LangMem, Zep, OpenAI, Mem0) across all task types, especially in multi-hop and temporal reasoning scenarios.
71-
72-
<img src="https://statics.memtensor.com.cn/memos/score_all_end2end.jpg" alt="END2END SCORE">
57+
MemOS demonstrates significant improvements over baseline memory solutions in multiple memory tasks,
58+
showcasing its capabilities in **information extraction**, **temporal and cross-session reasoning**, and **personalized preference responses**.
59+
60+
| Model | LOCOMO | LongMemEval | PrefEval-10 | PersonaMem |
61+
|-----------------|-------------|-------------|-------------|-------------|
62+
| **GPT-4o-mini** | 52.75 | 55.4 | 2.8 | 43.46 |
63+
| **MemOS** | **75.80** | **77.80** | **71.90** | **61.17** |
64+
| **Improvement** | **+43.70%** | **+40.43%** | **+2568%** | **+40.75%** |
65+
66+
### Detailed Evaluation Results
67+
- We use gpt-4o-mini as the processing and judging LLM and bge-m3 as embedding model in MemOS evaluation.
68+
- The evaluation was conducted under conditions that align various settings as closely as possible. Reproduce the results with our scripts at [`evaluation`](./evaluation).
69+
- Check the full search and response details at huggingface https://huggingface.co/datasets/MemTensor/MemOS_eval_result.
70+
> 💡 **MemOS outperforms all other methods (Mem0, Zep, Memobase, SuperMemory et al.) across all benchmarks!**
7371
7472
## ✨ Key Features
7573

0 commit comments

Comments
 (0)