Skip to content

BYO agent silently drops an incoming prompt when the agent→controller /api/tasks call fails (transient transport error) #2086

Description

@tjorourke

Summary

A BYO agent silently drops an incoming prompt when the agent→controller HTTP call to the task store (POST/GET /api/tasks) hits a transient transport error. The user's message is accepted (the UI shows it as sent) but the agent never runs — no LLM/tool/A2A activity, no task or events created in the controller, no reply, and no error surfaced anywhere the user can see. The agent only recovers after a pod restart.

Affected versions / environment

  • kagent-langgraph 0.9.6 (reproduced here)
  • a2a-sdk 0.3.23
  • Packaged under Solo Enterprise for kagent 0.4.3
  • BYO agent (type: BYO), KAGENT_URLkagent-controller
  • Istio ambient mesh (HBONE) on the agent→controller hop; EKS

Symptom

  1. Send a prompt to a BYO agent (e.g. from the kagent UI). The UI shows the message as sent.
  2. The agent does nothing: no graph execution, no downstream LLM/MCP/A2A calls, no task or events recorded in the controller, no response. The session just looks stuck.
  3. The agent container logs show the incoming request being received, but no subsequent work — the run is abandoned before the agent graph executes.
  4. Restarting the agent pod restores normal behaviour (until the condition recurs).

Where it occurs

On every incoming prompt, before the agent graph runs, the A2A runtime calls the controller-backed task store (KAgentTaskStore): GET /api/tasks/{id} followed by POST /api/tasks. When that hop raises a transient httpx.TransportError — e.g. an idle keep-alive connection reset by the mesh, or the controller pod being rescheduled — the error propagates out of the incoming-request handling and the prompt is lost.

Scope — not framework-specific

Reproduced with kagent-langgraph 0.9.6. The same construct — a controller httpx.AsyncClient built in the adapter's _a2a.py and wrapped in the shared KAgentTaskStore, invoked on every incoming prompt — is also present in the kagent-adk adapter (python/.../kagent/adk/_a2a.py). So this affects BYO agents using the controller-backed task store regardless of framework adapter; the common component is KAgentTaskStore / the controller client.

Impact / severity

Silent loss of a user request on the primary interaction path, with no error surfaced to the user and recovery only via a pod restart. In a service-mesh deployment the triggering condition (idle connection reset / controller reschedule) is routine, so it recurs. In our environment it reproduces reliably — the first prompt after the agent→controller connection has been idle is dropped every time.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    Status
    Backlog

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions