[Architecture RFC] Resolving Step Latency and Implementing the "Pending" Embedding Retrieval in Memory Modules #190

ZhehaoZhao423 · 2026-03-12T07:10:49Z

ZhehaoZhao423
Mar 12, 2026

Disclaimer: Following Jackie and Ewout's notes on maintainer bandwidth, this is strictly a conceptual RFC. No PRs will be opened until the community has time to breathe and align.

1. Motivation: The Latency Bottleneck Across Memory Implementations

As mesa-llm pushes toward production stability, managing the context window without exploding API latency is a primary friction point. After diving deep into the current memory implementations, I noticed two distinct $O(N)$ LLM-call bottlenecks during the step() phase:

In _st_lt_memory: _update_long_term_memory() executes a blocking, sequential LLM call to summarize and consolidate data whenever the short-term deque fills up.
In EpisodicMemory: While implementing a brilliant Generative Agents architecture, grade_event_importance() currently triggers an LLM generation call for every single memory entry added to evaluate its score.

Both approaches cause compounding step latency and high token costs in long-running, multi-agent simulations.

2. Conceptual Design: Fulfilling the "Pending" Relevance Retrieval

I noticed a crucial design note left in episodic_memory.py regarding the Top-K retrieval formula:

"Relevance (embedding cosine similarity with a focal query) is pending."

Drawing from my background in building academic RAG systems, and building perfectly upon @BhoomiAgrawal12's excellent recent discussion(#189) to decouple memory instantiation, I propose completing this architecture. The goal is to shift from LLM-based maintenance (Summarization/Grading) to Local Vector Retrieval (RAG).

Core Mechanisms:

Completing the Generative Agents Architecture: I propose implementing the missing "Relevance" score. By converting the agent's current Observation (or step prompt) into a query embedding, the system would dynamically compute the Cosine Similarity against historical memory embeddings to fetch only the strictly relevant context.
Zero-Blocking Storage: Instead of calling the LLM to grade importance or summarize during process_step, the module would simply convert the stringified MemoryEntry into a lightweight embedding (e.g., utilizing litellm's existing embedding capabilities). This would drop the memory maintenance API latency to near zero.
Dependency-Free (Pure NumPy): To strictly adhere to Mesa's lightweight philosophy and avoid heavy VectorDBs (like Chroma or FAISS), embeddings would be stored in a standard numpy.ndarray. The cosine similarity for the "Relevance" score could then be instantly calculated via numpy.dot().

3. Memory Safety & Scale

For those reasonably concerned about storing embeddings in-memory during extremely long simulations (OOM risks) :

A standard 384-dimensional embedding consumes ~1.5KB. A simulation of 100 agents running 1,000 steps requires only ~150MB of RAM matrix space.
We can seamlessly integrate this with the existing max_capacity parameter in EpisodicMemory (or a dedicated VectorRAGMemory) to prune the NumPy matrix, ensuring strict $O(1)$ memory consumption per agent over infinite steps via cognitive forgetting.

4. Open Questions for Maintainers

Whenever the team has the bandwidth to review, I'd love your thoughts:

Architecture Path: Should we build a standalone VectorRAGMemory class, or integrate this pure-NumPy embedding logic directly into EpisodicMemory to complete its intended Top-K retrieval design?
Dependencies: Is numpy considered an acceptable lightweight dependency for handling these internal matrix operations within mesa-llm?

Thank you for your continuous hard work on the Mesa ecosystem!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Architecture RFC] Resolving Step Latency and Implementing the "Pending" Embedding Retrieval in Memory Modules #190

Uh oh!

{{title}}

Uh oh!

Replies: 0 comments

Select a reply

Uh oh!

Uh oh!

[Architecture RFC] Resolving Step Latency and Implementing the "Pending" Embedding Retrieval in Memory Modules #190

Uh oh!

ZhehaoZhao423 Mar 12, 2026

1. Motivation: The Latency Bottleneck Across Memory Implementations

2. Conceptual Design: Fulfilling the "Pending" Relevance Retrieval

3. Memory Safety & Scale

4. Open Questions for Maintainers

Replies: 0 comments

ZhehaoZhao423
Mar 12, 2026