Skip to content

bug: apply_plan() flattens multiple tool call results into one dict, silently losing all but the last #142

@khushiiagrawal

Description

@khushiiagrawal

Describe the bug

When the LLM returns multiple tool calls in a single plan, apply_plan() (and aapply_plan()) uses a double-loop dict comprehension to merge all tool call results into a single flat dict before passing it to add_to_memory(). Since every tool result has the same keys (name and response), later entries overwrite earlier ones. Only the last tool call's result is stored in memory, the rest are permanently lost.

This is different from #137 which is about step_content overwriting same-type entries. This bug destroys the data before it even reaches add_to_memory().

File: mesa_llm/llm_agent.py, lines 117–125 (sync) and 92–100 (async)

self.memory.add_to_memory(
    type="action",
    content={
        k: v
        for tool_call in tool_call_resp
        for k, v in tool_call.items()
        if k not in ["tool_call_id", "role"]
    },
)

Each tool result from ToolManager._process_tool_call() returns {"tool_call_id": ..., "role": ..., "name": ..., "response": ...}. After filtering, every result has exactly name and response, so the dict comprehension just keeps overwriting.

Expected behavior

All executed tool call results should be preserved in the agent's memory. If the LLM decides to both move and arrest in one step, the agent should remember both actions, not just the arrest.

To Reproduce

from unittest.mock import MagicMock, patch
from mesa.model import Model
from mesa.space import MultiGrid
from mesa_llm.llm_agent import LLMAgent
from mesa_llm.memory.st_memory import ShortTermMemory
from mesa_llm.reasoning.react import ReActReasoning
from mesa_llm.reasoning.reasoning import Plan

import os
os.environ["GEMINI_API_KEY"] = "dummy"

model = Model(seed=42)
model.grid = MultiGrid(5, 5, torus=False)
agent = LLMAgent(model=model, reasoning=ReActReasoning, vision=-1)
agent.memory = ShortTermMemory(agent=agent, n=5, display=False)

# Simulate LLM returning 2 tool calls
fake_response = [
    {"tool_call_id": "1", "role": "tool", "name": "move_one_step",   "response": "agent moved to (3, 4)"},
    {"tool_call_id": "2", "role": "tool", "name": "arrest_citizen",  "response": "Citizen 12 arrested"},
]

agent.tool_manager.call_tools = lambda agent, llm_response: fake_response
plan = Plan(step=0, llm_plan="do something")
agent.apply_plan(plan)

print(agent.memory.step_content)
# {'action': {'name': 'arrest_citizen', 'response': 'Citizen 12 arrested'}}
# move_one_step is gone

Additional context

This affects all three reasoning strategies (CoT, ReAct, ReWOO) and all existing example models — Epstein Civil Violence (["move_one_step", "arrest_citizen"]), Negotiation (["teleport_to_location", "speak_to", "buy_product"]), Sugarscape (["move_to_best_resource", "propose_trade"]). Any time the LLM decides to call more than one tool, the agent's memory is incomplete.

The existing test test_apply_plan_adds_to_memory only uses a single-item fake response, so it never triggers this.

A possible fix would be to store tool results as a list:

self.memory.add_to_memory(
    type="action",
    content={
        "tool_calls": [
            {k: v for k, v in tc.items() if k not in ["tool_call_id", "role"]}
            for tc in tool_call_resp
        ]
    },
)

Both apply_plan() and aapply_plan() need the same fix, plus MemoryEntry.__str__() would need to handle the new structure for display.

@wang-boyu @sanika-n would appreciate your views on this . thanks.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions