fix: agent loses context and halts after first session compaction#3042
Merged
Conversation
compactIfNeeded estimated the token impact of newly added messages via sess.GetAllMessages(), which recurses into sub-sessions. In multi-agent runs the content produced by a transfer_task child was therefore attributed to the parent session even though it never enters the parent's prompt (GetMessages skips sub-session items). The phantom tokens triggered a compaction of a parent conversation that was actually tiny; with everything fitting the keep budget the split resolved to the 'compact everything, keep nothing' sentinel, so the user's task and the in-flight tool exchange were wiped. The agent's next prompt was literally just 'Session Summary: ...', which models read as the user asking for a summary and answer with a confused 'I see no conversation history' reply, halting mid-task. Add Session.OwnMessages() (direct messages only, no sub-session recursion) and use it for the trigger's before/after counts so the estimate matches what the session actually sends. Fixes docker#2871 Assisted-By: docker-agent Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
The compactor used fixed absolute budgets: MaxSummaryTokens (16k) was subtracted from the window when sizing the summarizer's input, and maxKeepTokens (20k) sized the verbatim-kept tail. Since ead9745 made compaction activate for models whose window resolves from provider_opts.context_size, both constants can exceed the entire window: contextAvailable went to zero, FirstIndexInBudget dropped every conversation message, and the summarizer received only its own prompts. It then fabricated an 'I see no conversation history' non-summary that replaced the real session history. Scale both budgets to the window (min(16k, limit/4) for the summary cap, min(20k, limit/5) for the kept tail) so the kept tail plus the summary always land well under the compaction threshold, and use the scaled cap for the summary call's max_tokens so small-window providers don't reject the request. As a safety net, RunLLM now no-ops when not a single conversation message fits the summarization budget (e.g. one giant tool result) instead of running the summarizer on an empty conversation and wiping the history with the result. ComputeFirstKeptEntry gains a contextLimit parameter so hook-supplied summaries share the same kept-tail policy; a non-positive limit falls back to the unscaled budget. Related to docker#2871 Assisted-By: docker-agent Signed-off-by: Djordje Lukic <djordje.lukic@docker.com>
dgageot
approved these changes
Jun 9, 2026
docker-agent
reviewed
Jun 9, 2026
docker-agent
left a comment
There was a problem hiding this comment.
Assessment: 🟢 APPROVE
This PR correctly addresses two compounding compaction bugs:
- Phantom token trigger — switching from
GetAllMessages()(which recurses into sub-sessions) toOwnMessages()(which does not) ensures sub-agent token counts no longer falsely trigger parent-session compaction. - Fixed budget overflow — scaling
MaxSummaryTokensandmaxKeepTokensproportionally to the context window (limit/4andlimit/5) prevents the summarizer from consuming the entire budget on small-window models, and thelen(messages) <= 2no-op guard correctly prevents a fabricated non-summary from replacing real session history.
Verification summary:
- The
ApplyCompactionpath only appends tos.Messages, so thesess.OwnMessages()[messageCountBefore:]slice incompactIfNeededcannot panic (length is monotonically non-decreasing in a single-goroutine call chain). OwnMessages()excluding system-role items is intentional and consistent withGetAllMessages(); the invariant system messages inGetMessages()are built dynamically and were never stored in session items.- All four new tests (
TestCompactIfNeeded_IgnoresSubSessionTokens,TestCompactIfNeeded_TriggersOnOwnMessages,TestRunLLM_SmallContextWindow,TestRunLLM_NoConversationFits_NoOps) directly target the described regression scenarios.
No confirmed or likely bugs found in the changed code.
aheritier
added a commit
that referenced
this pull request
Jun 10, 2026
…windows After the fix in #3042, the summary and keep-tail token budgets used during session compaction scale proportionally to provider_opts.context_size instead of using absolute 16k/20k constants. Small-context-window models (≤ ~16k) no longer have their history wiped during compaction. Ref: #3042
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2871
Problem
After the first session compaction in a multi-agent run, the agent halts mid-task and replies as if it has no conversation history ("I understand you're looking for a session summary, but ... no previous conversation history visible").
Root cause
Two compounding bugs, surfaced by ead9745 / 8dba51f (2026-05-18) which expanded when compaction activates — four days before the issue was filed:
Phantom trigger in multi-agent runs:
compactIfNeededestimated newly-added tokens viasess.GetAllMessages(), which recurses into sub-sessions. The content produced by atransfer_taskchild was attributed to the parent session even though it never enters the parent's prompt (GetMessagesskips sub-session items). The phantom tokens triggered compaction of a parent conversation that was actually tiny; with everything fitting the keep budget, the split resolved to the "compact everything, keep nothing" sentinel — wiping the user's task and the in-flight tool exchange. The agent's next prompt was literally justSession Summary: ..., which models read as the user asking for a summary. This also explains the "first compaction only" symptom: the first compaction fires while the parent history is still tiny; after re-prompting, later compactions keep a real tail.Fixed budgets break small context windows:
MaxSummaryTokens(16k) andmaxKeepTokens(20k) are absolute constants. For models whose window resolves fromprovider_opts.context_sizeand is ≤ ~16k, the summarizer's input budget went to zero — it received only its own prompts, fabricated a "no history" non-summary, and that text replaced the entire session history.Fix
Session.OwnMessages()(no sub-session recursion) now drives the compaction trigger's token accounting, so sub-agent work no longer causes phantom parent compactions.min(16k, limit/4)/min(20k, limit/5)); the scaled cap is also used for the summary call'smax_tokens.RunLLMno-ops when no conversation message fits the summarization budget, instead of running the summarizer on an empty conversation and wiping history with the result.ComputeFirstKeptEntrygains acontextLimitparameter so hook-supplied summaries share the same kept-tail policy.Tests
TestCompactIfNeeded_IgnoresSubSessionTokens— regression test, verified to fail against the old trigger code.TestCompactIfNeeded_TriggersOnOwnMessages— large own tool results still trigger.TestRunLLM_SmallContextWindow— summarizer receives real conversation on an 8k window and a tail is kept.TestRunLLM_NoConversationFits_NoOps— empty summarizer input no-ops instead of wiping history.task build,task test,task lintall pass (only pre-existing, environment-dependentpkg/sandbox.TestExtraWorkspacefailure, which also fails on cleanmain).Note for reviewers
One residual hazard left untouched (documented contract defended in 1e9512e): a legitimately triggered compaction whose whole conversation fits the keep budget (possible with image-heavy histories — token estimates ignore images) still drops the tail via the "compact everything" sentinel. Happy to follow up with a "keep the last user turn on threshold/overflow compaction" change if desired.
Assisted-By: docker-agent