-
-
Notifications
You must be signed in to change notification settings - Fork 52
Description
Describe the bug
When the LLM returns multiple tool calls in a single plan, apply_plan() (and aapply_plan()) uses a double-loop dict comprehension to merge all tool call results into a single flat dict before passing it to add_to_memory(). Since every tool result has the same keys (name and response), later entries overwrite earlier ones. Only the last tool call's result is stored in memory, the rest are permanently lost.
This is different from #137 which is about step_content overwriting same-type entries. This bug destroys the data before it even reaches add_to_memory().
File: mesa_llm/llm_agent.py, lines 117–125 (sync) and 92–100 (async)
self.memory.add_to_memory(
type="action",
content={
k: v
for tool_call in tool_call_resp
for k, v in tool_call.items()
if k not in ["tool_call_id", "role"]
},
)Each tool result from ToolManager._process_tool_call() returns {"tool_call_id": ..., "role": ..., "name": ..., "response": ...}. After filtering, every result has exactly name and response, so the dict comprehension just keeps overwriting.
Expected behavior
All executed tool call results should be preserved in the agent's memory. If the LLM decides to both move and arrest in one step, the agent should remember both actions, not just the arrest.
To Reproduce
from unittest.mock import MagicMock, patch
from mesa.model import Model
from mesa.space import MultiGrid
from mesa_llm.llm_agent import LLMAgent
from mesa_llm.memory.st_memory import ShortTermMemory
from mesa_llm.reasoning.react import ReActReasoning
from mesa_llm.reasoning.reasoning import Plan
import os
os.environ["GEMINI_API_KEY"] = "dummy"
model = Model(seed=42)
model.grid = MultiGrid(5, 5, torus=False)
agent = LLMAgent(model=model, reasoning=ReActReasoning, vision=-1)
agent.memory = ShortTermMemory(agent=agent, n=5, display=False)
# Simulate LLM returning 2 tool calls
fake_response = [
{"tool_call_id": "1", "role": "tool", "name": "move_one_step", "response": "agent moved to (3, 4)"},
{"tool_call_id": "2", "role": "tool", "name": "arrest_citizen", "response": "Citizen 12 arrested"},
]
agent.tool_manager.call_tools = lambda agent, llm_response: fake_response
plan = Plan(step=0, llm_plan="do something")
agent.apply_plan(plan)
print(agent.memory.step_content)
# {'action': {'name': 'arrest_citizen', 'response': 'Citizen 12 arrested'}}
# move_one_step is goneAdditional context
This affects all three reasoning strategies (CoT, ReAct, ReWOO) and all existing example models — Epstein Civil Violence (["move_one_step", "arrest_citizen"]), Negotiation (["teleport_to_location", "speak_to", "buy_product"]), Sugarscape (["move_to_best_resource", "propose_trade"]). Any time the LLM decides to call more than one tool, the agent's memory is incomplete.
The existing test test_apply_plan_adds_to_memory only uses a single-item fake response, so it never triggers this.
A possible fix would be to store tool results as a list:
self.memory.add_to_memory(
type="action",
content={
"tool_calls": [
{k: v for k, v in tc.items() if k not in ["tool_call_id", "role"]}
for tc in tool_call_resp
]
},
)Both apply_plan() and aapply_plan() need the same fix, plus MemoryEntry.__str__() would need to handle the new structure for display.
@wang-boyu @sanika-n would appreciate your views on this . thanks.