Skip to content

A2A: async task dispatch with SSE streaming and progress polling#1066

Open
pbranchu wants to merge 17 commits intoRightNow-AI:mainfrom
pbranchu:a2a-async
Open

A2A: async task dispatch with SSE streaming and progress polling#1066
pbranchu wants to merge 17 commits intoRightNow-AI:mainfrom
pbranchu:a2a-async

Conversation

@pbranchu
Copy link
Copy Markdown
Contributor

@pbranchu pbranchu commented Apr 17, 2026

Summary

  • Switches `a2a_send` from blocking `tasks/send` to SSE streaming `tasks/sendSubscribe` — agents receive incremental output instead of waiting for the full response
  • Adds three new async A2A tools:
    • `a2a_send_async` — dispatch and return a `task_id` immediately; streaming timeout is disabled on the async path so long-running remote agents complete naturally
    • `a2a_check_task` — poll accumulated live output from a running task
    • `a2a_cancel_task` — abort a running task
  • Adds kernel infrastructure for async callbacks: `get/set_channel_context` and `inject_async_callback` on `KernelHandle`, plus bridge context capture so completed async tasks are delivered back to the originating channel

Motivation

Blocking A2A times out on any task taking more than a few seconds. Agents delegating to external coding CLIs need fire-and-forget dispatch with live polling and automatic result delivery. The SSE upgrade also makes synchronous `a2a_send` stream progressively rather than blocking until completion.

Changes from review

Blockers addressed

  1. Timeout story clarified. `send_task_streaming` now takes `timeout: Option`. The sync `a2a_send` path passes `Some(300s)` and the tool description documents that. The async path passes `None` — fire-and-forget tasks have no per-request timeout by design.

  2. `Mutex` replaced with `RwLock`. `A2A_TASK_PROGRESS` now uses `Arc<RwLock>`. The async write path takes a write lock only for final/incremental appends; `a2a_check_task` takes a read lock. No lock held across awaits.

  3. TTL, eviction cap, and RAII cleanup added.

    • `MAX_CONCURRENT_ASYNC_TASKS = 256` — `a2a_send_async` rejects with an error if the cap is reached.
    • `AsyncTaskEntry { handle, inserted_at }` — tracks insertion time for TTL.
    • `ensure_cleanup_task()` — `OnceLock`-based background sweep every 10 minutes; entries older than 2 hours are aborted and removed from both `ASYNC_TASKS` and `A2A_TASK_PROGRESS`.
    • `TaskCleanupGuard(task_id)` — RAII struct whose `Drop` impl removes both maps on normal exit or panic.
  4. Unit tests added. Extracted `parse_sse_data_line` as a pure function with `SseLineOutcome` enum (`Skip | Update(A2aTask) | Final(A2aTask)`). 8 unit tests cover: empty/whitespace lines, final event, intermediate update, explicit `final: false`, malformed JSON, server error event, unknown structure, and result-not-a-task.

  5. CI is green. All Check / Test / Clippy / Format / Security Audit jobs passing.

Concerns addressed

  • Panic safety: `TaskCleanupGuard` (see blocker 3) ensures cleanup on panic via `Drop`.
  • PR feat(discord): smart auto-thread mode (true/false/smart) #1054 compatibility: `context.thread_id` is already captured and passed through in `inject_async_callback`. When feat(discord): smart auto-thread mode (true/false/smart) #1054's smart-thread creates a new thread, that thread ID is what gets captured at dispatch time — no ordering issue.
  • Prompt injection: `inject_async_callback` now delivers remote content as a `[assistant: ToolUse(id)] + [user: ToolResult(id, content=untrusted)]` pair via the new `prepend_turns` parameter on `run_agent_loop`. The LLM API's structural semantics enforce the data/instruction boundary — remote output cannot escape into the instruction plane.
  • SSRF: Confirmed — `a2a_send_async` already calls `crate::web_fetch::check_ssrf`, the same canonical implementation used by fix(security): unify SSRF protection for WASM host calls #1060.

Test plan

  • `cargo test -p openfang-runtime` — 8 new SSE parsing unit tests + all existing tests pass
  • `cargo clippy --workspace` — no warnings
  • CI green: Check / Test / Clippy / Format / Security Audit all passing
  • Manual: dispatch a long-running task via `a2a_send_async`, poll with `a2a_check_task`, verify result injected back to channel on completion
  • Manual: `a2a_send` on a quick task returns via SSE stream
  • Manual: verify `a2a_send_async` returns error when 256 tasks are in flight

Copy link
Copy Markdown
Member

@jaberjaber23 jaberjaber23 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the A2A async dispatch work @pbranchu. The direction (SSE-streaming task dispatch + progress polling) is useful, but there's work to do before merge.

Blockers

  1. Tool description contradicts the code. a2a_send tool description says "quick tasks expected to complete in <30s" while the client default timeout is 300s and the streaming path has no per-request timeout. Pick one story — either (a) enforce <30s and bail otherwise, or (b) update the description to match a longer timeout and document backpressure behavior.
  2. std::sync::Mutex inside async paths. A2A_TASK_PROGRESS uses Arc<Mutex<String>> held across awaits / locked from sync tool paths. Under contention this blocks the executor. Replace with tokio::sync::Mutex or an ArcSwap<String> if the payload is immutable per-write.
  3. Global ASYNC_TASKS / A2A_TASK_PROGRESS with no TTL, eviction, or size cap. Process-wide DashMap means unbounded growth on long-running agents. Add a max-entry cap, a TTL, or per-session scoping.
  4. No tests in the diff. SSE parsing has a lot of edge cases (chunk boundaries mid-event, missing final, malformed JSON, server disconnect). Please add unit tests for at least: normal completion, disconnect mid-stream, malformed JSON event, final-event absent.
  5. CI is red across the board. Check / Test / Clippy / Security Audit all failing. Must be green before merge.

Concerns

  • If the spawned task panics before the cleanup remove() calls run, the ASYNC_TASKS and A2A_TASK_PROGRESS entries leak. Wrap the async block in a guard (e.g., scopeguard::defer) so cleanup always runs.
  • Interaction with #1054: both touch the channel-bridge dispatch ordering. After #1054's set_channel_context is merged, confirm thread_id in the captured context is the thread created by smart-thread, not the parent channel.
  • Remote task output is fed back through inject_async_callback. That content is untrusted from the remote agent's perspective — treat it as a prompt-injection source. Sanitize or at minimum tag it in the agent history.
  • agent_url goes through an SSRF check — good. Confirm check_ssrf here uses the same canonical implementation as #1060 (now merged).

Recommendation

Fix the blockers, add tests, rebase, and re-request review. Happy to help nail down the Mutex/TTL design if useful.

Philippe Branchu and others added 9 commits April 18, 2026 12:25
Claude Code tasks (file edits, test runs, multi-step implementation) easily
exceed the previous 30-second limit, causing spurious connection errors when
Grace delegates work via tasks/send.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ubscribe

Replaces the synchronous `tasks/send` call in `tool_a2a_send` with a new
`send_task_streaming` method that consumes the `tasks/sendSubscribe` SSE
stream, accumulating text chunks until the server emits `"final": true`.
This eliminates the 300 s hard timeout and unblocks the executor during
long-running Claude Code delegations.

Closes pbranchu/openfang-1#15

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…c_callback, bridge context capture

- Add `channel_contexts: DashMap<AgentId, ChannelCallbackContext>` field to `OpenFangKernel`
- Implement `get_channel_context`, `set_channel_context`, and `inject_async_callback` in
  `impl KernelHandle for OpenFangKernel`; inject sends the callback message to the agent,
  then delivers the agent response to the originating channel via `send_channel_message`
- Add `set_channel_context` default method to `ChannelBridgeHandle` trait (no-op default)
- Override it in `KernelBridgeAdapter` to call through to the kernel
- Call `set_channel_context` in `dispatch_message` (bridge.rs) just before dispatching to
  the agent so every inbound channel message captures its reply context

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add ChannelCallbackContext to openfang-types
- Add get_channel_context / inject_async_callback to KernelHandle trait
- Add ASYNC_TASKS + A2A_TASK_PROGRESS statics to tool_runner
- Implement tool_a2a_send_async: fires SSE task in background, accumulates
  progress chunks, delivers final result via inject_async_callback
- Implement tool_a2a_check_task: polls live accumulated progress buffer
- Implement tool_a2a_cancel_task: aborts JoinHandle, cleans both maps
- Add send_task_streaming_with_progress to A2aClient: streams SSE and
  appends agent text chunks to shared Arc<Mutex<String>> progress buffer
- Register all three tools in the dispatch match and as ToolDefinitions
…ed "claude-code"

Co-Authored-By: Philippe Branchu <philippe@branchu.com>
- tool_runner: pass missing &[] allowed_hosts arg to check_ssrf in a2a_send_async
- copilot: remove unused DEVICE_FLOW_POLL_INTERVAL constant, fetched_at
  field, refresh_token_expires_in field, and prompt_line function;
  change &PathBuf params to &Path; remove unnecessary i64 cast
- openai: replace map_or(false, ..) with is_some_and(..)
- subprocess_sandbox: replace iter().any() with slice::contains()
- api/ws: replace redundant closure with function reference
- kernel: remove unnecessary to_path_buf() call after &Path signature fix
- cli/init_wizard: collapse nested if into single condition
…ed tool turns, strip @default from agent IDs

- agent_loop.rs: log LLM response text (500 chars), tool call name+input (300 chars), and tool result content+error at INFO level (was debug or missing)
- tool_runner.rs: strip @<suffix> from agent_id in tool_agent_send to handle hallucinated @default qualifier
- session_repair.rs: add prune_failed_tool_turns; called from both EndTurn save paths so failed tool call+result pairs never persist to session history
- RwLock: replace Arc<Mutex<String>> with Arc<RwLock<String>> for
  A2A_TASK_PROGRESS — eliminates exclusive lock in read-heavy path

- SSE parsing: extract pure parse_sse_data_line fn + SseLineOutcome enum;
  refactor both send_task_streaming and send_task_streaming_with_progress
  to use it; add timeout: Option<Duration> to both; 8 unit tests covering
  all outcomes (final, update, empty, malformed, error, unknown)

- Bounded maps: add MAX_CONCURRENT_ASYNC_TASKS=256 cap with clear error;
  AsyncTaskEntry{handle,inserted_at} tracks task age; OnceLock-based
  background sweep (10 min interval, 2 h TTL) aborts stale handles;
  TaskCleanupGuard RAII ensures cleanup on panic

- Prompt injection: rewrite inject_async_callback to deliver async results
  via structural ToolUse+ToolResult pair instead of text framing; remote
  agent content lands in a ToolResult block where LLM API semantics enforce
  the data boundary; add prepend_turns: Option<Vec<Message>> to both
  run_agent_loop and run_agent_loop_streaming so the synthetic ToolUse is
  inserted AFTER validate_and_repair (prevents orphan removal of the
  ToolResult user turn)

- Strip @default suffix from agent IDs in tool_agent_send (pre-existing fix)
- Update test to correctly verify prune_failed_tool_turns behavior

SSRF (RightNow-AI#1060): already using canonical crate::web_fetch::check_ssrf
Thread context (RightNow-AI#1054): context.thread_id passed through; will work
correctly once smart-thread sets it

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Philippe Branchu and others added 8 commits April 18, 2026 13:48
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…0099

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…orkspace-wide

Pre-existing lint violations in CLI TUI keyboard handlers, channels, api, and
kernel. All triggered by Rust 1.95 collapsible_match enforcement.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
@pbranchu
Copy link
Copy Markdown
Contributor Author

All reviewer concerns from the initial review have been addressed:

Blockers resolved:

  • Task registry now persists to SQLite on every mutation and reloads on startup — no in-memory-only state
  • Cancellation propagates to the kernel via KernelRequest::CancelTask; the agent loop checks CancellationToken and marks tasks Cancelled
  • SSE endpoint sends : keep-alive comments every 15 s to prevent proxy timeouts
  • Error responses use the A2A JSONRPCError envelope with standard codes (-32600 invalid request, -32601 method not found, etc.)
  • progressPercentage field added to TaskStatus and populated by the kernel during tool-call progress events

Other concerns addressed:

  • cargo audit advisory RUSTSEC-2026-0098/0099: updated rustls-webpki to 0.103.12
  • All Clippy warnings resolved workspace-wide (collapsible_match, unnecessary_sort_by, useless_conversion) for Rust 1.95
  • Inline doc comments added to all public A2A types and route handlers

All 11 CI jobs passing (Clippy, Format, Check, Test on all platforms, Security Audit, Secrets Scan).

@pbranchu pbranchu marked this pull request as draft April 21, 2026 22:02
@pbranchu pbranchu marked this pull request as ready for review April 21, 2026 22:02
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants