fix: store graded EpisodicMemory entries as MemoryEntry objects and use correct llm instance by psbuilds · Pull Request #109 · mesa/mesa-llm

psbuilds · 2026-03-01T05:51:30Z

Summary

In EpisodicMemory graded entries were calculated but never stored as MemoryEntry objects, causing get_prompt_ready() to return empty memory context to the LLM. Also fixed the grading methods using the wrong LLM instance.

Bug / Issue

Fixes: #108

add_to_memory never creates MemoryEntry objects:

add_to_memory() grades each event via an LLM call, but only stores the result in step_content (via super().add_to_memory()), which is immediately cleared. No MemoryEntry is ever created, so memory_entries stays empty and retrieve_top_k_entries() / get_prompt_ready() return nothing to the LLM.

Wrong LLM instance for grading:
grade_event_importance() sets the system prompt on self.llm but calls self.agent.llm.generate(). This means the grading system prompt is never used and the agent's own system prompt gets silently overwritten during grading.

Implementation

episodic_memory.py:
add_to_memory()/ aadd_to_memory() : Now creates a MemoryEntry with the graded content and appends it to memory_entries. Uses {**content, "importance": grade} to avoid mutating the input dict.
grade_event_importance() / agrade_event_importance() : Changed self.agent.llm.generate() → self.llm.generate() so the grading system prompt is actually used.

Testing

The already existing test files were updated to prove this behaviour fix.

…se correct LLM instance

coderabbitai · 2026-03-01T05:51:36Z

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15ade4f5-f8b0-4894-be5c-ed46d699de10

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-03-01T05:55:23Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.45%. Comparing base (f888a0a) to head (e124ec8).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   90.08%   90.45%   +0.36%     
==========================================
  Files          19       19              
  Lines        1503     1540      +37     
==========================================
+ Hits         1354     1393      +39     
+ Misses        149      147       -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

psbuilds · 2026-03-02T16:40:17Z

Hey @colinfrisch @sanika-n @wang-boyu , While working on fixes for this file I noticed that the grading logic in episodic memory currently seems to be like this:

top_list = sorted(
            self.memory_entries,
            key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step),
            reverse=True,
        )

which is basically calculating the difference between the step at which an event happened and the current step, and then subtracting that linear time penalty from the importance score.

The issue is that since importance is on a fixed 1-5 scale, this linear penalty quickly overwhelms the score. For example, a Critical memory (score 5) from just 10 steps ago results in a final score of -5, making it less "prominent" than a completely Irrelevant memory (score 1) from the current step (score 0).

In practice, this means high-importance memories are effectively "forgotten" by the retrieval logic almost immediately.

Following the "Generative Agents" paper, I suggest we move toward a normalized/weighted scoring approach, using exponential decay for recency and scaling importance to a [0, 1] range. This would ensure that vital memories remain retrievable over longer periods.

Does this look like something we should address? I'm happy to fix this with a more robust scoring implementation.
(should it be in this PR or as a separate one?)

wang-boyu · 2026-03-02T18:34:28Z

We can probably do it together in this PR.

Two questions -

Is importance a key in memory entry content, or one level below inside content[content_type]? I'm not sure whether this works as expected:
```
key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step)
```
I guess we were tring to use self.agent.model.steps - x.step as some sort of recency measure? The paper uses
- recency score that decays exponentially with a factor of 0.995
- importance score that ranges from 1 to 10, later min-max scaled to [0, 1]
- there's also a relevance score that changes according to an incoming query
Then the final overall ranking (called "retrieval score") is computed as the sum of these three. They are added as a weighted sum controlled by some $\alpha$ coefficients, but they are all set to 1 so it's just a simple sum.

I suppose the purpose of the min-max scaling on importance is to match the range of other scores, so that it does not outweigh them.

psbuilds · 2026-03-03T05:48:52Z

We can probably do it together in this PR.

Two questions -

Is importance a key in memory entry content, or one level below inside content[content_type]? I'm not sure whether this works as expected:

Based on my understanding I think in this file the memory entires are created in a way that suggests that 'importance' an entry inside MemoryEntry.content['importance']

key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step)
I guess we were tring to use self.agent.model.steps - x.step as some sort of recency measure? The paper uses

recency score that decays exponentially with a factor of 0.995

importance score that ranges from 1 to 10, later min-max scaled to [0, 1]

there's also a relevance score that changes according to an incoming query

Then the final overall ranking (called "retrieval score") is computed as the sum of these three. They are added as a weighted sum controlled by some $\alpha$ coefficients, but they are all set to 1 so it's just a simple sum.

I suppose the purpose of the min-max scaling on importance is to match the range of other scores, so that it does not outweigh them.

Yup that's exactly the purpose, but additionally,

I suggest we implement methods to calculate all the 3 factors and refactor the entire grading logic currently present.

Importance can be normalised
like something of the form

importance = (raw_importance - 1) / 4

Recency can be calculated using

age = current_step - entry.step
recency = 0.995 ** age

I'm not sure of how to implement the relevance logic, would you happen to have any ideas for this ?

Cosine similarity could be something to start with maybe?

And then finally the function could return the sum as the result to produce the ranking result.

@boyu what do you think about this method?
If it seems good I'll proceed with it :)

wang-boyu · 2026-03-03T15:40:42Z

Importance can be normalised like something of the form
importance = (raw_importance - 1) / 4

Not quite. The paper uses min-max scaling that during retrieval, all scores are collected and scaled together: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L234

In fact they did the same scaling process for recency and relevance scores too (seperate process for each of them).

Recency can be calculated using
age = current_step - entry.step
recency = 0.995 ** age

This looks very much to what was implemented in the paper: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L145

Similarly we could also have an adjustable parameter such as their recency_decay with default value of 0.995.

I'm not sure of how to implement the relevance logic, would you happen to have any ideas for this ?
Cosine similarity could be something to start with maybe?
And then finally the function could return the sum as the result to produce the ranking result.

Yes it seems an text embedding model was used, followed by a cosine similarity score: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L175

But unlike the other two scores, relevance needs a query string. Since this is something new to our EpisodicMemory, I suggest to have a separate PR for it. Changes to recency and importance scores are more of a fix to be done here.

psbuilds · 2026-03-04T15:29:45Z

Hey @wang-boyu, Hope you are dong good :)
took a bit more time than I expected but got it done.

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.

Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

Also may I open a new issue for handling the relevance logic :)

wang-boyu · 2026-03-05T20:52:26Z

Thanks for the updates @psbuilds

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.

Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

I'm a bit confused. Are we using importance scores in other memory types?

For this PR it seems that EpisodicMemory add memory entires per event, unlike other memory types, where memory entries are consolidated and added per step. I've updated this PR to reflect that, but this means EpisodicMemory does not use self.step_content or do pre / post step processing at all. Before I merge, @psbuilds could you confirm whether this behavior is correct?

This may link to #137. If it is expected behavior, then EpisodicMemory doesn't have the issue of overwriting memory entries of the same type, since it's simply appending new entries for each individual event.

However, this does mean that the internal attributs is inconsistent now:

self.step_content for other memory types than episodic memory
self.memory_entries for episodic memory
self.long_term_memory for long term memory
self.short_term_memory for short tem memory
self.short_term_memory and self.long_term_memory for st lt memory

We might need to come up with consistent, unified APIs for memory retrieval. But again, not in this PR.

psbuilds · 2026-03-06T17:33:35Z

Thanks for the updates @psbuilds

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.
Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

I'm a bit confused. Are we using importance scores in other memory types?

For this PR it seems that EpisodicMemory add memory entires per event, unlike other memory types, where memory entries are consolidated and added per step. I've updated this PR to reflect that, but this means EpisodicMemory does not use self.step_content or do pre / post step processing at all. Before I merge, @psbuilds could you confirm whether this behavior is correct?

Yup this had come into my notice while working on it, from my understanding EpisodicMemory is designed around per event granularity, with each individual event stored as its own discrete memory with an importance score.

Due to this I think the process_step being a no-op makes sense.

This may link to #137. If it is expected behavior, then EpisodicMemory doesn't have the issue of overwriting memory entries of the same type, since it's simply appending new entries for each individual event.

Regarding #137, you're correct. Since EpisodicMemory appends each event as a separate MemoryEntry
rather than using step_content[type] = content, it is not affected by the dict-key overwriting issue.

According to this #137 may be closed since the issue doesn't exist.

However, this does mean that the internal attributs is inconsistent now:

self.step_content for other memory types than episodic memory

self.memory_entries for episodic memory

self.long_term_memory for long term memory

self.short_term_memory for short tem memory

self.short_term_memory and self.long_term_memory for st lt memory

We might need to come up with consistent, unified APIs for memory retrieval. But again, not in this PR.

Yup this is something that should definitely be added, happy to work :)

wang-boyu · 2026-03-06T23:07:01Z

#137 may still be valid for st / stlt memories. There's an open PR #165 on it.

We can think more about retrieval API for memories, but I'll merge this PR first.

Thanks for your work and the discussions so far.

psbuilds · 2026-03-07T03:21:04Z

Hey @wang-boyu , after a quick review of #165 I noticed an issue with the implementation used there, step_content is modified to accommodate either a list or a dict so will have dict values for some keys and list values for others. Any code that reads step_content has to check if it is a list or a dict before doing anything with it.

I think this approach can lead to future breaks and is way too complicated, although I believe cascading changes will be necessary for a potential fix for this issue, the mixed types of step_content adds into it a layer of unwanted confusion.
Would love to get your thoughts on this.

I think I can come up with something better but, It may take a few days, if you're okay with that happy to proceed :)

wang-boyu · 2026-03-08T01:59:21Z

Thanks for flagging this. I haven't got the chance to look into #165 yet, but I agree that having mixed types (either list or a dict) in step_content would complicate usages and maintenance.

Since #165 is already open to address #137, I would suggest syncing with its author first, to discuss alternatives and see if the implementations can be adjusted accordingly, so that we can avoid duplicated efforts on the same task.

If it turns out that some broader refactor / fix is required, then maybe we can have a follow-up as a separate PR.

fix: store graded EpisodicMemory entries as MemoryEntry objects and u…

421d4c3

…se correct LLM instance

wang-boyu added the bug Release notes label label Mar 1, 2026

Refactored sync/async add to memory functions and updated tests.

d3db689

Improved grading logic(recency + importance) added tests.

8451485

psbuilds and others added 4 commits March 5, 2026 01:20

Imporved codecov coverage.

35995ec

skip pre/post step processing of memory entries in episodic memory

ac90308

Merge branch 'main' into fix/episodic-memory

98e0e87

fix integration test for episodic memory

962f277

colinfrisch assigned wang-boyu Mar 6, 2026

Merge branch 'main' into fix/episodic-memory

e124ec8

wang-boyu approved these changes Mar 6, 2026

View reviewed changes

wang-boyu merged commit 8db52c1 into mesa:main Mar 6, 2026
13 checks passed

Uh oh!

Conversation

psbuilds commented Mar 1, 2026

Summary

Bug / Issue

Implementation

Testing

Uh oh!

coderabbitai bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov bot commented Mar 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

psbuilds commented Mar 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wang-boyu commented Mar 2, 2026

Uh oh!

psbuilds commented Mar 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wang-boyu commented Mar 3, 2026

Uh oh!

psbuilds commented Mar 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

wang-boyu commented Mar 5, 2026

Uh oh!

psbuilds commented Mar 6, 2026

Uh oh!

wang-boyu commented Mar 6, 2026

Uh oh!

Uh oh!

psbuilds commented Mar 7, 2026

Uh oh!

wang-boyu commented Mar 8, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

coderabbitai bot commented Mar 1, 2026 •

edited

Loading

codecov bot commented Mar 1, 2026 •

edited

Loading

psbuilds commented Mar 2, 2026 •

edited

Loading

psbuilds commented Mar 3, 2026 •

edited

Loading

psbuilds commented Mar 4, 2026 •

edited

Loading