Fix non-terminating stash retrieve loop on explicit working-memory reads#467
Merged
Conversation
The per-call tool-result cap (CapToolResultAsync) and the watermark trimmer (ToolResultTrimmer) re-stashed the result of an explicit GetFromWorkingMemory retrieval under the retrieval call's new id, then advertised that new key back to the model. The model fetched it, got a slightly larger reference, which was re-stashed again -- a retrieve->re-stash->retrieve loop that made no progress until the iteration/timeout budget killed it. Observed 2026-06-10: a communications-briefing subagent burned its full 15-minute budget this way after pulling a ~15k-char multi-account email payload. ChunkingAIFunction already exempted these working-memory read tools from re-chunking for the same reason. Centralize that exemption in a shared StashExemptTools set and honor it in all three paths (chunk, cap, trim) so an explicit retrieval is always returned in full and never re-stashed. Bump version to 0.12.30. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
The 2026-06-10 10am communications briefing cron job spent 15 minutes and timed out. Investigation of the live k8s cluster showed calendar-mcp was healthy and fast (yesterday's IMAP fix #69 / image
1.4.1works — email/calendar calls complete in 1–6s). The timeout was a non-terminating stash retrieve loop in the agent host.When a tool result is too large, the host trims it (head + elision marker + tail) and stashes the full original in working memory under
stash/{session}/{callId}, telling the model to fetch it viaGetFromWorkingMemory. But the retrieval result was itself oversized, so the per-call cap (CapToolResultAsync) and watermark trimmer (ToolResultTrimmer) re-stashed it under the retrieval call's new id and advertised that key back. The model fetched it, got a larger reference, which was re-stashed again — looping ~35s/iteration until the budget killed it:The earlier
llm-high-tier-cost-guardsubagent hit the identical loop.Fix
ChunkingAIFunctionalready exempted the working-memory read tools (GetFromWorkingMemory/SearchWorkingMemory/ListWorkingMemory) from re-chunking for exactly this reason. This PR centralizes that exemption in a sharedStashExemptToolsset and honors it in all three paths:StashExemptTools(new) — single source of truth.ChunkingAIFunction— uses the shared set (removed private duplicate).CapToolResultAsync— returns explicit-retrieval results unchanged.ToolResultTrimmer.TrimAsync— skips exempt results when picking the largest result to trim.An explicit retrieval is now always returned in full and never re-stashed.
Tests
Added 4 regression tests (with
[Timeout]guards mirroring the real loop).RockBot.Host.Tests: 1061 passed, 0 failed.Deployment
Version bumped
0.12.29 -> 0.12.30. Imagerockylhotka/rockbot-agent:0.12.30built, pushed, and deployed to the liverockbotnamespace viakubectl set imagefor testing; calendar-mcp confirmsClient (RockBot.Agent 0.12.30.0)is live.