-
Notifications
You must be signed in to change notification settings - Fork 176
Description
What happened:
For streaming responses, Prometheus metrics for token counts were recorded with an empty model_name
label, while the target_model_name
label was correctly populated. This corrupts observability data, making it impossible to filter metrics by the public-facing model name.
Additionally, the token counting logic itself was brittle. It only parsed the final message in a stream for the usage
block, meaning if token counts appeared in an earlier message, they would be missed.
What you expected to happen:
Metrics for streaming responses should be recorded with all labels, including model_name
and target_model_name
, correctly populated from the request context. The token counting logic should also be robust and accumulate usage data from all messages in the stream.
How to reproduce it (as minimally and precisely as possible):
This was discovered when refactoring the hermetic integration tests.
- Send a streaming request (e.g., a chat completion request) where the model name is not present in the top-level JSON body.
- Ensure the request includes the
x-gateway-api-inference-objective-key
header. - Observe the
inference_objective_input_tokens_bucket
metric after the request completes. - The metric will be present, but the
model_name
label will be empty (model_name=""
).
Anything else we need to know?:
Root Cause Analysis:
Two related issues were discovered:
- The
director.HandleRequest
function incorrectly overwrites theRequestContext.IncomingModelName
(which is correctly set from the objective header) with a value parsed from the request body'smodel
field. For requests like chat completions, this field doesn't exist at the top level, causingIncomingModelName
to be reset to an empty string. This corrupted context persists for the life of the stream and is used when the final response metrics are recorded. - The token counting logic in
HandleResponseBodyModelStreaming
only checked the final[DONE]
message for ausage
block. It did not accumulate token counts from earlier messages in the stream, making it possible to miss metrics entirely.
Production Risk / Impact:
The production risk is high. This bug leads to corrupted and unusable observability data for common streaming use cases. Metrics for streaming token usage will have a missing model_name
label, making it impossible to accurately filter, aggregate, or alert on a per-model basis possibly breaking monitoring, billing, and capacity planning capabilities.
Environment:
- Discovered during hermetic integration test refactoring.