[Bug] ACP Cost Tracking Bug

### Summary

Analyzying the traces of [this](https://openhands-ai.slack.com/archives/C09QGUDQVTL/p1773037302157719) Gaia Evaluation, it comes up that 68 out of 157 completed instances (43%) report exactly $0.00 cost despite performing significant work.

Precisely, these 68 zero-cost instances:
- Resolved 51 tasks successfully
- Made 1,783 ACP tool calls (avg 26.2/instance) -- actually more than the tracked instances (avg 16.6/instance)
- 1,446 fetch calls (web search/browse)
- 293 execute calls (terminal/python)
- 36 Agent (subagent) calls
- Produced 26,263 chars of log output on average
- Have costs: [] (empty array) in their metrics -- no cost entries at all
- Report prompt_tokens: 0, completion_tokens: 0 in accumulated_token_usage -- zero tokens despite multi-turn conversations

What tracked instances look like for the 89 instances with cost tracking report:
- Cost model: "acp-managed" with a single cost entry per conversation turn
- Avg ~364 prompt tokens, ~6,583 completion tokens, ~232,912 cache_read tokens per instance
- Total: $54.60 across 1,475 ACP calls

### Possible Root cause
The bug is a race condition in notification draining combined with ACP's lack of turn-completion signaling.

In acp_agent.py:68-76, _drain_notifications() is a best-effort hack:

```
  async def _drain_notifications() -> None:
      await asyncio.sleep(0)
      await asyncio.sleep(_NOTIFICATION_DRAIN_DELAY)  # default 0.1s
```

  The flow is:
  1. conn.prompt() sends user message to ACP server and waits for PromptResponse
  2. _drain_notifications() sleeps 100ms hoping UsageUpdate notifications arrive
  3. _record_usage() reads the cost/token data that (hopefully) arrived

The problem: UsageUpdate notifications are sent as async session updates by the ACP server. They arrive via the same JSON-RPC connection but as separate notifications. If the ACP server sends the UsageUpdate after the PromptResponse and the 100ms drain window is too short (due to network latency, server load, or the server simply sending it late), the UsageUpdate never gets processed for that turn.

The pattern of "all-or-nothing" (entire instances with zero cost, not individual turns) suggests the issue is per-session, not per-turn. This points to a second factor:

  - UsageUpdate.cost is Optional — the ACP server (Claude Code) may simply not send cost data at all for some sessions. The field is: cost: Optional[Cost] = None
  - The code at line 180 checks if update.cost is not None — if the ACP server sends UsageUpdate with cost=None, nothing gets recorded
  - Similarly, PromptResponse.usage is Optional (marked as UNSTABLE in the schema) — if the server doesn't populate it, _record_usage skips token recording entirely (line 344: if response is not None and
  response.usage is not None)

  The two contributing factors

  1. ACP server doesn't always send cost/usage data: Some Claude Code ACP server instances may not report cost in UsageUpdate or usage in PromptResponse — both are Optional fields. This explains the all-or-nothing
   pattern per instance.
  2. Race condition in drain: Even when cost data IS sent, the 100ms drain window may miss late-arriving UsageUpdate notifications. The code explicitly acknowledges this with the TODO referencing
  https://github.com/agentclientprotocol/agent-client-protocol/issues/554.

### Possible Solution
 Two fixes, both should be applied:

**Fix 1 — Use PromptResponse.usage for cost estimation when UsageUpdate is missing (in _record_usage):**

Currently _record_usage only records tokens from PromptResponse.usage but never estimates cost from it. When UsageUpdate.cost is missing (the costs array is empty), you should compute a cost estimate from the token counts in PromptResponse.usage using known model pricing. This is the fallback that was apparently removed in commit 6e99789d.

**Fix 2 — Increase drain delay or make it adaptive:**

Change _NOTIFICATION_DRAIN_DELAY default from 0.1 to something like 0.5 or 1.0, or better yet, implement a polling loop that waits until a UsageUpdate with cost data arrives (with a timeout). Something like:
```python
  async def _drain_notifications(client, timeout=2.0) -> None:
      deadline = asyncio.get_event_loop().time() + timeout
      while asyncio.get_event_loop().time() < deadline:
          await asyncio.sleep(0.1)
          if client._last_cost > 0:  # or check if UsageUpdate was received this turn
              break
```

**Fix 3 (Recommended) — Record cost from UsageUpdate AND fire the stats callback immediately:**

Currently at line 180-184, cost is added to metrics but the stats callback only fires later in _record_usage. If _record_usage is called before the UsageUpdate arrives (race condition), the cost is lost from the callback perspective. Move the stats callback firing into the session_update handler too, or record the cost delta in _record_usage based on whatever accumulated by that point.

The simplest high-confidence fix is restoring the cost fallback estimation from token usage in _record_usage — if after processing a turn, metrics.costs is still empty but PromptResponse.usage has token counts, estimate the cost from tokens. This handles both the "server never sent cost" and "drain missed the UsageUpdate" cases.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug] ACP Cost Tracking Bug #2375

Summary

Possible Root cause

Possible Solution

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug] ACP Cost Tracking Bug #2375

Description

Summary

Possible Root cause

Possible Solution

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions