[Architecture RFC] Resolving Step Latency and Implementing the "Pending" Embedding Retrieval in Memory Modules #190
ZhehaoZhao423
started this conversation in
General
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
Disclaimer: Following Jackie and Ewout's notes on maintainer bandwidth, this is strictly a conceptual RFC. No PRs will be opened until the community has time to breathe and align.
1. Motivation: The Latency Bottleneck Across Memory Implementations
As$O(N)$ LLM-call bottlenecks during the
mesa-llmpushes toward production stability, managing the context window without exploding API latency is a primary friction point. After diving deep into the current memory implementations, I noticed two distinctstep()phase:_st_lt_memory:_update_long_term_memory()executes a blocking, sequential LLM call to summarize and consolidate data whenever the short-term deque fills up.EpisodicMemory: While implementing a brilliant Generative Agents architecture,grade_event_importance()currently triggers an LLM generation call for every single memory entry added to evaluate its score.Both approaches cause compounding step latency and high token costs in long-running, multi-agent simulations.
2. Conceptual Design: Fulfilling the "Pending" Relevance Retrieval
I noticed a crucial design note left in
episodic_memory.pyregarding the Top-K retrieval formula:Drawing from my background in building academic RAG systems, and building perfectly upon @BhoomiAgrawal12's excellent recent discussion(#189) to decouple memory instantiation, I propose completing this architecture. The goal is to shift from LLM-based maintenance (Summarization/Grading) to Local Vector Retrieval (RAG).
Core Mechanisms:
Observation(or step prompt) into a query embedding, the system would dynamically compute the Cosine Similarity against historical memory embeddings to fetch only the strictly relevant context.process_step, the module would simply convert the stringifiedMemoryEntryinto a lightweight embedding (e.g., utilizinglitellm's existing embedding capabilities). This would drop the memory maintenance API latency to near zero.numpy.ndarray. The cosine similarity for the "Relevance" score could then be instantly calculated vianumpy.dot().3. Memory Safety & Scale
For those reasonably concerned about storing embeddings in-memory during extremely long simulations (OOM risks) :
max_capacityparameter inEpisodicMemory(or a dedicatedVectorRAGMemory) to prune the NumPy matrix, ensuring strict4. Open Questions for Maintainers
Whenever the team has the bandwidth to review, I'd love your thoughts:
VectorRAGMemoryclass, or integrate this pure-NumPy embedding logic directly intoEpisodicMemoryto complete its intended Top-K retrieval design?numpyconsidered an acceptable lightweight dependency for handling these internal matrix operations withinmesa-llm?Thank you for your continuous hard work on the Mesa ecosystem!
Beta Was this translation helpful? Give feedback.
All reactions