Skip to content

Bug: Streaming token metrics are undercounted and have a missing 'model_name' label #1626

@LukeAVanDrie

Description

@LukeAVanDrie

What happened:

For streaming responses, Prometheus metrics for token counts were recorded with an empty model_name label, while the target_model_name label was correctly populated. This corrupts observability data, making it impossible to filter metrics by the public-facing model name.

Additionally, the token counting logic itself was brittle. It only parsed the final message in a stream for the usage block, meaning if token counts appeared in an earlier message, they would be missed.

What you expected to happen:

Metrics for streaming responses should be recorded with all labels, including model_name and target_model_name, correctly populated from the request context. The token counting logic should also be robust and accumulate usage data from all messages in the stream.

How to reproduce it (as minimally and precisely as possible):

This was discovered when refactoring the hermetic integration tests.

  1. Send a streaming request (e.g., a chat completion request) where the model name is not present in the top-level JSON body.
  2. Ensure the request includes the x-gateway-api-inference-objective-key header.
  3. Observe the inference_objective_input_tokens_bucket metric after the request completes.
  4. The metric will be present, but the model_name label will be empty (model_name="").

Anything else we need to know?:

Root Cause Analysis:

Two related issues were discovered:

  1. The director.HandleRequest function incorrectly overwrites the RequestContext.IncomingModelName (which is correctly set from the objective header) with a value parsed from the request body's model field. For requests like chat completions, this field doesn't exist at the top level, causing IncomingModelName to be reset to an empty string. This corrupted context persists for the life of the stream and is used when the final response metrics are recorded.
  2. The token counting logic in HandleResponseBodyModelStreaming only checked the final [DONE] message for a usage block. It did not accumulate token counts from earlier messages in the stream, making it possible to miss metrics entirely.

Production Risk / Impact:

The production risk is high. This bug leads to corrupted and unusable observability data for common streaming use cases. Metrics for streaming token usage will have a missing model_name label, making it impossible to accurately filter, aggregate, or alert on a per-model basis possibly breaking monitoring, billing, and capacity planning capabilities.

Environment:

  • Discovered during hermetic integration test refactoring.

Metadata

Metadata

Assignees

Labels

kind/bugCategorizes issue or PR as related to a bug.triage/acceptedIndicates an issue or PR is ready to be actively worked on.

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions