Skip to content

Fix non-terminating stash retrieve loop on explicit working-memory reads#467

Merged
rockfordlhotka merged 1 commit into
mainfrom
fix/stash-retrieval-loop
Jun 10, 2026
Merged

Fix non-terminating stash retrieve loop on explicit working-memory reads#467
rockfordlhotka merged 1 commit into
mainfrom
fix/stash-retrieval-loop

Conversation

@rockfordlhotka

Copy link
Copy Markdown
Member

Problem

The 2026-06-10 10am communications briefing cron job spent 15 minutes and timed out. Investigation of the live k8s cluster showed calendar-mcp was healthy and fast (yesterday's IMAP fix #69 / image 1.4.1 works — email/calendar calls complete in 1–6s). The timeout was a non-terminating stash retrieve loop in the agent host.

When a tool result is too large, the host trims it (head + elision marker + tail) and stashes the full original in working memory under stash/{session}/{callId}, telling the model to fetch it via GetFromWorkingMemory. But the retrieval result was itself oversized, so the per-call cap (CapToolResultAsync) and watermark trimmer (ToolResultTrimmer) re-stashed it under the retrieval call's new id and advertised that key back. The model fetched it, got a larger reference, which was re-stashed again — looping ~35s/iteration until the budget killed it:

GetFromWorkingMemory(stash/.../call_iz7u7s2) -> big -> re-stashed as call_V5QImO2
GetFromWorkingMemory(stash/.../call_V5QImO2) -> big -> re-stashed as call_Qm6TZL
GetFromWorkingMemory(stash/.../call_Qm6TZL)  -> ... (until 15-min timeout)

The earlier llm-high-tier-cost-guard subagent hit the identical loop.

Fix

ChunkingAIFunction already exempted the working-memory read tools (GetFromWorkingMemory/SearchWorkingMemory/ListWorkingMemory) from re-chunking for exactly this reason. This PR centralizes that exemption in a shared StashExemptTools set and honors it in all three paths:

  • StashExemptTools (new) — single source of truth.
  • ChunkingAIFunction — uses the shared set (removed private duplicate).
  • CapToolResultAsync — returns explicit-retrieval results unchanged.
  • ToolResultTrimmer.TrimAsync — skips exempt results when picking the largest result to trim.

An explicit retrieval is now always returned in full and never re-stashed.

Tests

Added 4 regression tests (with [Timeout] guards mirroring the real loop). RockBot.Host.Tests: 1061 passed, 0 failed.

Deployment

Version bumped 0.12.29 -> 0.12.30. Image rockylhotka/rockbot-agent:0.12.30 built, pushed, and deployed to the live rockbot namespace via kubectl set image for testing; calendar-mcp confirms Client (RockBot.Agent 0.12.30.0) is live.

The per-call tool-result cap (CapToolResultAsync) and the watermark trimmer
(ToolResultTrimmer) re-stashed the result of an explicit GetFromWorkingMemory
retrieval under the retrieval call's new id, then advertised that new key back
to the model. The model fetched it, got a slightly larger reference, which was
re-stashed again -- a retrieve->re-stash->retrieve loop that made no progress
until the iteration/timeout budget killed it. Observed 2026-06-10: a
communications-briefing subagent burned its full 15-minute budget this way after
pulling a ~15k-char multi-account email payload.

ChunkingAIFunction already exempted these working-memory read tools from
re-chunking for the same reason. Centralize that exemption in a shared
StashExemptTools set and honor it in all three paths (chunk, cap, trim) so an
explicit retrieval is always returned in full and never re-stashed.

Bump version to 0.12.30.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@rockfordlhotka rockfordlhotka merged commit 2447dd4 into main Jun 10, 2026
2 checks passed
@rockfordlhotka rockfordlhotka deleted the fix/stash-retrieval-loop branch June 10, 2026 18:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant