feat: agent/team SSE reconnection and resume#6849
Open
Conversation
kausmeows
commented
Mar 3, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds SSE-based reconnection support for agent runs when using
background=True, stream=True. When a frontend user refreshes the page or loses network during an agent's SSE stream, they can now reconnect via a new/resumeendpoint and pick up where they left off — catching up on missed events and continuing to receive live events if the agent is still running.This follows the same pattern as workflows, where
background=True + stream=Trueuses WebSocket-based reconnection. For agents, reconnection is done over SSE instead.Key changes:
background=True, stream=Truenow supported inarun_dispatch— previously rejected withValueError, now routes to new_arun_background_stream_arun_backgroundpattern for non-streaming background runsasyncio.Task(inside_arun_background_stream) so it survives client disconnectionsEventsBufferwith a sequentialevent_index, following the same pattern asworkflow.pySSESubscriberManager— managesasyncio.Queuesubscribers for live event forwarding to resumed clientsPOST /agents/{agent_id}/runs/{run_id}/resumeendpoint — three reconnection paths (live subscription, buffer replay, DB fallback)format_sse_event_with_index()helper — injectsevent_indexandrun_idinto SSE payloads without modifying core dataclassesArchitecture
Problem
When
background=Trueis used, the agent should keep running even if the client disconnects. Previously,agent.arun()ran inside theStreamingResponseasync generator — when the HTTP connection closed (page refresh, network loss), Starlette cancelled the generator, killing the agent mid-execution. All events were lost with no way to reconnect.Additionally,
arun_dispatchrejectedbackground=True, stream=Truewith aValueError, so there was no way to combine background execution with streaming.Solution: Decoupled Producer/Consumer (background=True only)
When
background=True, stream=True, the new_arun_background_streamin_run.py:aread_or_create_session+upsert_run+asave_session)asyncio.Taskthat runs_arun_streamand publishes events to:EventsBuffer— for catch-up replay on reconnection (lazy import fromagno.os.managers, same asworkflow.py)SSESubscriberManagerqueues — for live forwarding to/resumeclientsasyncio.Queue— read by the originalStreamingResponsegeneratorformat_sse_event_with_indexfromagno.os.utils)When the original client disconnects, only the thin queue-reader is cancelled. The background task keeps running in the event loop until the agent completes.
When
background=False(default): The original direct-yield streamer is used — agent runs inline inside the generator, and client disconnection cancels the agent. No buffering, no event_index injection. Behavior is identical to before this PR.Streaming Path Selection
backgroundstreamfalsetruefalsefalsetruetrue/resumeavailable.truefalseComponents
1. Background Stream Execution (
_run.py—_arun_background_stream)New function in
_run.pythat handles thebackground=True, stream=Truepath. Similar to how_arun_backgroundhandlesbackground=True, stream=False, but for streaming:asyncio.Taskthat runs_arun_streamevent_bufferand publishes tosse_subscriber_manager(lazy imports fromagno.os.managers, matching the workflow pattern whereworkflow.pyimports fromagno.os.managersforevent_bufferandwebsocket_manager)format_sse_event_with_index(lazy import fromagno.os.utils)asyncio.Queuerun_response.statusfor final status (set by_arun_stream/acleanup_and_store)The router's
agent_resumable_response_streameris now a thin wrapper that just callsagent.arun(background=True, stream=True)and yields the SSE strings.2. Event Buffering (
managers.py—EventsBuffer, pre-existing)Every event is stored with a sequential
event_index(0, 1, 2, ...). The sameEventsBufferclass that workflows already use.add_event(run_id, event)-> returnsevent_indexget_events(run_id, last_event_index=N)-> returns events after index Nset_run_completed(run_id, status)-> marks run done, triggers cleanup after 30minget_run_status(run_id)-> returnsrunning,completed,error, etc.3. SSE Subscriber Manager (
managers.py— new)When a
/resumeclient connects while the agent is still running, it registers anasyncio.Queue. The producer pushes every event to all registered queues. ANonesentinel signals completion.4.
event_indexInjection (utils.py—format_sse_event_with_index)Resumable SSE events include an
event_indexfield in their JSON payload. Used by_arun_background_streamin_run.pyto format events. The coreBaseAgentRunEventdataclass is not modified.5.
/resumeEndpoint (router.py)Three reconnection paths:
agent.aget_run_output()Race condition handling: After subscribing but before entering the queue loop, the buffer status is re-checked. If the run completed during catch-up, remaining events are replayed from buffer instead of waiting on an empty queue (the sentinel was pushed before the subscription existed).
Meta Events
The
/resumestream may include these meta events before actual data:catch_upreplaysubscribederrorFiles Changed
libs/agno/agno/os/routers/agents/router.py_resume_stream_generator, new/resumeendpointlibs/agno/agno/os/managers.pySSESubscriberManagerclass + global instancelibs/agno/agno/os/utils.pyformat_sse_event_with_index()helpercookbook/05_agent_os/client/10_sse_reconnect.pyType of change
Checklist
./scripts/format.shand./scripts/validate.sh