Summary
A BYO agent silently drops an incoming prompt when the agent→controller HTTP call to the task store (POST/GET /api/tasks) hits a transient transport error. The user's message is accepted (the UI shows it as sent) but the agent never runs — no LLM/tool/A2A activity, no task or events created in the controller, no reply, and no error surfaced anywhere the user can see. The agent only recovers after a pod restart.
Affected versions / environment
kagent-langgraph 0.9.6 (reproduced here)
a2a-sdk 0.3.23
- Packaged under Solo Enterprise for kagent 0.4.3
- BYO agent (
type: BYO), KAGENT_URL → kagent-controller
- Istio ambient mesh (HBONE) on the agent→controller hop; EKS
Symptom
- Send a prompt to a BYO agent (e.g. from the kagent UI). The UI shows the message as sent.
- The agent does nothing: no graph execution, no downstream LLM/MCP/A2A calls, no task or events recorded in the controller, no response. The session just looks stuck.
- The agent container logs show the incoming request being received, but no subsequent work — the run is abandoned before the agent graph executes.
- Restarting the agent pod restores normal behaviour (until the condition recurs).
Where it occurs
On every incoming prompt, before the agent graph runs, the A2A runtime calls the controller-backed task store (KAgentTaskStore): GET /api/tasks/{id} followed by POST /api/tasks. When that hop raises a transient httpx.TransportError — e.g. an idle keep-alive connection reset by the mesh, or the controller pod being rescheduled — the error propagates out of the incoming-request handling and the prompt is lost.
Scope — not framework-specific
Reproduced with kagent-langgraph 0.9.6. The same construct — a controller httpx.AsyncClient built in the adapter's _a2a.py and wrapped in the shared KAgentTaskStore, invoked on every incoming prompt — is also present in the kagent-adk adapter (python/.../kagent/adk/_a2a.py). So this affects BYO agents using the controller-backed task store regardless of framework adapter; the common component is KAgentTaskStore / the controller client.
Impact / severity
Silent loss of a user request on the primary interaction path, with no error surfaced to the user and recovery only via a pod restart. In a service-mesh deployment the triggering condition (idle connection reset / controller reschedule) is routine, so it recurs. In our environment it reproduces reliably — the first prompt after the agent→controller connection has been idle is dropped every time.
Summary
A BYO agent silently drops an incoming prompt when the agent→controller HTTP call to the task store (
POST/GET /api/tasks) hits a transient transport error. The user's message is accepted (the UI shows it as sent) but the agent never runs — no LLM/tool/A2A activity, no task or events created in the controller, no reply, and no error surfaced anywhere the user can see. The agent only recovers after a pod restart.Affected versions / environment
kagent-langgraph0.9.6 (reproduced here)a2a-sdk0.3.23type: BYO),KAGENT_URL→kagent-controllerSymptom
Where it occurs
On every incoming prompt, before the agent graph runs, the A2A runtime calls the controller-backed task store (
KAgentTaskStore):GET /api/tasks/{id}followed byPOST /api/tasks. When that hop raises a transienthttpx.TransportError— e.g. an idle keep-alive connection reset by the mesh, or the controller pod being rescheduled — the error propagates out of the incoming-request handling and the prompt is lost.Scope — not framework-specific
Reproduced with
kagent-langgraph0.9.6. The same construct — a controllerhttpx.AsyncClientbuilt in the adapter's_a2a.pyand wrapped in the sharedKAgentTaskStore, invoked on every incoming prompt — is also present in thekagent-adkadapter (python/.../kagent/adk/_a2a.py). So this affects BYO agents using the controller-backed task store regardless of framework adapter; the common component isKAgentTaskStore/ the controller client.Impact / severity
Silent loss of a user request on the primary interaction path, with no error surfaced to the user and recovery only via a pod restart. In a service-mesh deployment the triggering condition (idle connection reset / controller reschedule) is routine, so it recurs. In our environment it reproduces reliably — the first prompt after the agent→controller connection has been idle is dropped every time.