Skip to content

fix: store graded EpisodicMemory entries as MemoryEntry objects and use correct llm instance #109

Merged
wang-boyu merged 8 commits intomesa:mainfrom
psbuilds:fix/episodic-memory
Mar 6, 2026
Merged

fix: store graded EpisodicMemory entries as MemoryEntry objects and use correct llm instance #109
wang-boyu merged 8 commits intomesa:mainfrom
psbuilds:fix/episodic-memory

Conversation

@psbuilds
Copy link
Contributor

@psbuilds psbuilds commented Mar 1, 2026

Summary

In EpisodicMemory graded entries were calculated but never stored as MemoryEntry objects, causing get_prompt_ready() to return empty memory context to the LLM. Also fixed the grading methods using the wrong LLM instance.

Bug / Issue

Fixes: #108

add_to_memory never creates MemoryEntry objects:

add_to_memory() grades each event via an LLM call, but only stores the result in step_content (via super().add_to_memory()), which is immediately cleared. No MemoryEntry is ever created, so memory_entries stays empty and retrieve_top_k_entries() / get_prompt_ready() return nothing to the LLM.

Wrong LLM instance for grading:
grade_event_importance() sets the system prompt on self.llm but calls self.agent.llm.generate(). This means the grading system prompt is never used and the agent's own system prompt gets silently overwritten during grading.

Implementation

episodic_memory.py:
add_to_memory()/ aadd_to_memory() : Now creates a MemoryEntry with the graded content and appends it to memory_entries. Uses {**content, "importance": grade} to avoid mutating the input dict.
grade_event_importance() / agrade_event_importance() : Changed self.agent.llm.generate() → self.llm.generate() so the grading system prompt is actually used.

Testing

The already existing test files were updated to prove this behaviour fix.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Mar 1, 2026

Important

Review skipped

Auto reviews are disabled on this repository. Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 15ade4f5-f8b0-4894-be5c-ed46d699de10

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Mar 1, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 90.45%. Comparing base (f888a0a) to head (e124ec8).
⚠️ Report is 12 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #109      +/-   ##
==========================================
+ Coverage   90.08%   90.45%   +0.36%     
==========================================
  Files          19       19              
  Lines        1503     1540      +37     
==========================================
+ Hits         1354     1393      +39     
+ Misses        149      147       -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@wang-boyu wang-boyu added the bug Release notes label label Mar 1, 2026
@psbuilds
Copy link
Contributor Author

psbuilds commented Mar 2, 2026

Hey @colinfrisch @sanika-n @wang-boyu , While working on fixes for this file I noticed that the grading logic in episodic memory currently seems to be like this:

top_list = sorted(
            self.memory_entries,
            key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step),
            reverse=True,
        )

which is basically calculating the difference between the step at which an event happened and the current step, and then subtracting that linear time penalty from the importance score.

The issue is that since importance is on a fixed 1-5 scale, this linear penalty quickly overwhelms the score. For example, a Critical memory (score 5) from just 10 steps ago results in a final score of -5, making it less "prominent" than a completely Irrelevant memory (score 1) from the current step (score 0).

In practice, this means high-importance memories are effectively "forgotten" by the retrieval logic almost immediately.

Following the "Generative Agents" paper, I suggest we move toward a normalized/weighted scoring approach, using exponential decay for recency and scaling importance to a [0, 1] range. This would ensure that vital memories remain retrievable over longer periods.

Does this look like something we should address? I'm happy to fix this with a more robust scoring implementation.
(should it be in this PR or as a separate one?)

@wang-boyu
Copy link
Member

We can probably do it together in this PR.

Two questions -

  • Is importance a key in memory entry content, or one level below inside content[content_type]? I'm not sure whether this works as expected:

    key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step)
  • I guess we were tring to use self.agent.model.steps - x.step as some sort of recency measure? The paper uses

    • recency score that decays exponentially with a factor of 0.995
    • importance score that ranges from 1 to 10, later min-max scaled to [0, 1]
    • there's also a relevance score that changes according to an incoming query

    Then the final overall ranking (called "retrieval score") is computed as the sum of these three. They are added as a weighted sum controlled by some $\alpha$ coefficients, but they are all set to 1 so it's just a simple sum.

    I suppose the purpose of the min-max scaling on importance is to match the range of other scores, so that it does not outweigh them.

@psbuilds
Copy link
Contributor Author

psbuilds commented Mar 3, 2026

We can probably do it together in this PR.

Two questions -

  • Is importance a key in memory entry content, or one level below inside content[content_type]? I'm not sure whether this works as expected:

Based on my understanding I think in this file the memory entires are created in a way that suggests that 'importance' an entry inside MemoryEntry.content['importance']

key=lambda x: x.content["importance"] - (self.agent.model.steps - x.step)
  • I guess we were tring to use self.agent.model.steps - x.step as some sort of recency measure? The paper uses

    • recency score that decays exponentially with a factor of 0.995

    • importance score that ranges from 1 to 10, later min-max scaled to [0, 1]

    • there's also a relevance score that changes according to an incoming query

    Then the final overall ranking (called "retrieval score") is computed as the sum of these three. They are added as a weighted sum controlled by some $\alpha$ coefficients, but they are all set to 1 so it's just a simple sum.

    I suppose the purpose of the min-max scaling on importance is to match the range of other scores, so that it does not outweigh them.

Yup that's exactly the purpose, but additionally,

I suggest we implement methods to calculate all the 3 factors and refactor the entire grading logic currently present.

Importance can be normalised
like something of the form

importance = (raw_importance - 1) / 4

Recency can be calculated using

age = current_step - entry.step
recency = 0.995 ** age

I'm not sure of how to implement the relevance logic, would you happen to have any ideas for this ?

Cosine similarity could be something to start with maybe?

And then finally the function could return the sum as the result to produce the ranking result.

@boyu what do you think about this method?
If it seems good I'll proceed with it :)

@wang-boyu
Copy link
Member

Importance can be normalised like something of the form
importance = (raw_importance - 1) / 4

Not quite. The paper uses min-max scaling that during retrieval, all scores are collected and scaled together: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L234

In fact they did the same scaling process for recency and relevance scores too (seperate process for each of them).

Recency can be calculated using
age = current_step - entry.step
recency = 0.995 ** age

This looks very much to what was implemented in the paper: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L145

Similarly we could also have an adjustable parameter such as their recency_decay with default value of 0.995.

I'm not sure of how to implement the relevance logic, would you happen to have any ideas for this ?
Cosine similarity could be something to start with maybe?
And then finally the function could return the sum as the result to produce the ranking result.

Yes it seems an text embedding model was used, followed by a cosine similarity score: https://github.com/joonspk-research/generative_agents/blob/fe05a71d3e4ed7d10bf68aa4eda6dd995ec070f4/reverie/backend_server/persona/cognitive_modules/retrieve.py#L175

But unlike the other two scores, relevance needs a query string. Since this is something new to our EpisodicMemory, I suggest to have a separate PR for it. Changes to recency and importance scores are more of a fix to be done here.

@psbuilds
Copy link
Contributor Author

psbuilds commented Mar 4, 2026

Hey @wang-boyu, Hope you are dong good :)
took a bit more time than I expected but got it done.

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.

Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

Also may I open a new issue for handling the relevance logic :)

@wang-boyu
Copy link
Member

Thanks for the updates @psbuilds

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.

Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

I'm a bit confused. Are we using importance scores in other memory types?

For this PR it seems that EpisodicMemory add memory entires per event, unlike other memory types, where memory entries are consolidated and added per step. I've updated this PR to reflect that, but this means EpisodicMemory does not use self.step_content or do pre / post step processing at all. Before I merge, @psbuilds could you confirm whether this behavior is correct?

This may link to #137. If it is expected behavior, then EpisodicMemory doesn't have the issue of overwriting memory entries of the same type, since it's simply appending new entries for each individual event.

However, this does mean that the internal attributs is inconsistent now:

  • self.step_content for other memory types than episodic memory
  • self.memory_entries for episodic memory
  • self.long_term_memory for long term memory
  • self.short_term_memory for short tem memory
  • self.short_term_memory and self.long_term_memory for st lt memory

We might need to come up with consistent, unified APIs for memory retrieval. But again, not in this PR.

@psbuilds
Copy link
Contributor Author

psbuilds commented Mar 6, 2026

Thanks for the updates @psbuilds

Also added a private function _extract_importance() to handle cases when nested dictionaries and flat dictionaries are present and we have to retrieve the keys effectively.
Noticed this same issue recurring in other PRs would you want me to add this to all memory files ?

I'm a bit confused. Are we using importance scores in other memory types?

For this PR it seems that EpisodicMemory add memory entires per event, unlike other memory types, where memory entries are consolidated and added per step. I've updated this PR to reflect that, but this means EpisodicMemory does not use self.step_content or do pre / post step processing at all. Before I merge, @psbuilds could you confirm whether this behavior is correct?

Yup this had come into my notice while working on it, from my understanding EpisodicMemory is designed around per event granularity, with each individual event stored as its own discrete memory with an importance score.

Due to this I think the process_step being a no-op makes sense.

This may link to #137. If it is expected behavior, then EpisodicMemory doesn't have the issue of overwriting memory entries of the same type, since it's simply appending new entries for each individual event.

Regarding #137, you're correct. Since EpisodicMemory appends each event as a separate MemoryEntry
rather than using step_content[type] = content, it is not affected by the dict-key overwriting issue.

According to this #137 may be closed since the issue doesn't exist.

However, this does mean that the internal attributs is inconsistent now:

  • self.step_content for other memory types than episodic memory
  • self.memory_entries for episodic memory
  • self.long_term_memory for long term memory
  • self.short_term_memory for short tem memory
  • self.short_term_memory and self.long_term_memory for st lt memory

We might need to come up with consistent, unified APIs for memory retrieval. But again, not in this PR.

Yup this is something that should definitely be added, happy to work :)

@wang-boyu
Copy link
Member

#137 may still be valid for st / stlt memories. There's an open PR #165 on it.

We can think more about retrieval API for memories, but I'll merge this PR first.

Thanks for your work and the discussions so far.

@wang-boyu wang-boyu merged commit 8db52c1 into mesa:main Mar 6, 2026
13 checks passed
@psbuilds
Copy link
Contributor Author

psbuilds commented Mar 7, 2026

Hey @wang-boyu , after a quick review of #165 I noticed an issue with the implementation used there, step_content is modified to accommodate either a list or a dict so will have dict values for some keys and list values for others. Any code that reads step_content has to check if it is a list or a dict before doing anything with it.

I think this approach can lead to future breaks and is way too complicated, although I believe cascading changes will be necessary for a potential fix for this issue, the mixed types of step_content adds into it a layer of unwanted confusion.
Would love to get your thoughts on this.

I think I can come up with something better but, It may take a few days, if you're okay with that happy to proceed :)

@wang-boyu
Copy link
Member

Thanks for flagging this. I haven't got the chance to look into #165 yet, but I agree that having mixed types (either list or a dict) in step_content would complicate usages and maintenance.

Since #165 is already open to address #137, I would suggest syncing with its author first, to discuss alternatives and see if the implementations can be adjusted accordingly, so that we can avoid duplicated efforts on the same task.

If it turns out that some broader refactor / fix is required, then maybe we can have a follow-up as a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Release notes label

Projects

None yet

Development

Successfully merging this pull request may close these issues.

EpisodicMemory: Graded entries are never stored, as a result LLM never receives memory context

2 participants