Skip to content

fix(langchain): Extract token usage from message.usage_metadata for streaming responses#127

Open
NikitaVoitov wants to merge 2 commits intosignalfx:mainfrom
NikitaVoitov:fix/token-usage-streaming
Open

fix(langchain): Extract token usage from message.usage_metadata for streaming responses#127
NikitaVoitov wants to merge 2 commits intosignalfx:mainfrom
NikitaVoitov:fix/token-usage-streaming

Conversation

@NikitaVoitov
Copy link

Summary

Fixes token usage extraction to support streaming mode by checking message.usage_metadata in addition to llm_output. This enables accurate token tracking for OpenAI, Anthropic, and other providers when streaming is enabled.

Fixes #126

The Bug (Before)

# Streaming mode span
Span: chat gpt-4o-mini
Attributes:
  gen_ai.request.model: gpt-4o-mini
  # NO gen_ai.usage.input_tokens
  # NO gen_ai.usage.output_tokens

The Fix (After)

# Streaming mode span
Span: chat gpt-4o-mini
Attributes:
  gen_ai.request.model: gpt-4o-mini
  gen_ai.usage.input_tokens: 40
  gen_ai.usage.output_tokens: 55

Changes

callback_handler.py

Added two helper functions and updated token extraction logic:

1. New _extract_token_usage_from_generations() helper:

def _extract_token_usage_from_generations(
    generations: list[list[Any]] | None,
) -> tuple[int | None, int | None]:
    """Extract token counts from message.usage_metadata (streaming format)."""
    if not generations:
        return None, None
    
    for generation_list in generations:
        for generation in generation_list:
            if not hasattr(generation, "message"):
                continue
            message = generation.message
            usage_meta = getattr(message, "usage_metadata", None)
            if not isinstance(usage_meta, dict) or not usage_meta:
                continue
            
            # Standard keys first, then OpenAI-style fallback
            input_tokens = usage_meta.get("input_tokens") or usage_meta.get("prompt_tokens")
            output_tokens = usage_meta.get("output_tokens") or usage_meta.get("completion_tokens")
            
            if input_tokens and output_tokens:
                return input_tokens, output_tokens
    
    return None, None

2. New _extract_token_usage_from_llm_output() helper:

def _extract_token_usage_from_llm_output(
    llm_output: dict[str, Any] | None,
    existing_input: int | None = None,
    existing_output: int | None = None,
) -> tuple[int | None, int | None]:
    """Extract token usage from llm_output (non-streaming format)."""
    if not llm_output:
        return existing_input, existing_output
    
    usage = llm_output.get("usage") or llm_output.get("token_usage") or {}
    
    input_tokens = existing_input
    if input_tokens is None:
        input_tokens = usage.get("prompt_tokens") or usage.get("input_tokens")
    
    output_tokens = existing_output
    if output_tokens is None:
        output_tokens = usage.get("completion_tokens") or usage.get("output_tokens")
    
    return input_tokens, output_tokens

3. Updated on_llm_end() with priority-based extraction:

# Before - only checked llm_output
llm_output = getattr(response, "llm_output", {}) or {}
usage = llm_output.get("usage") or llm_output.get("token_usage") or {}
inv.input_tokens = usage.get("prompt_tokens")
inv.output_tokens = usage.get("completion_tokens")

# After - checks both sources with priority
# Priority: message.usage_metadata (streaming) > llm_output (non-streaming)
input_tokens, output_tokens = _extract_token_usage_from_generations(generations)

# Fallback to llm_output for non-streaming responses
if input_tokens is None or output_tokens is None:
    llm_output = getattr(response, "llm_output", {}) or {}
    input_tokens, output_tokens = _extract_token_usage_from_llm_output(
        llm_output, input_tokens, output_tokens
    )

inv.input_tokens = input_tokens
inv.output_tokens = output_tokens

Token Source Priority

Priority Source Mode Keys Checked
1 (highest) message.usage_metadata Streaming input_tokens, output_tokens
2 (fallback) llm_output.token_usage Non-streaming prompt_tokens, completion_tokens

Testing

  • test_token_usage_extraction_streaming_mode - verifies message.usage_metadata extraction
  • test_token_usage_extraction_non_streaming_mode - verifies llm_output extraction
  • test_token_usage_streaming_priority - verifies streaming source takes precedence
  • All existing tests pass

Evidence

Live test showing the fix works:

Test response.usage_metadata Trace gen_ai.usage.*
Non-streaming input_tokens: 31, output_tokens: 5 31, 5
Streaming (before fix) input_tokens: 40, output_tokens: 55 MISSING
Streaming (after fix) input_tokens: 40, output_tokens: 57 40, 57

Trace Evidence:

Files Changed

File Changes
instrumentation-genai/opentelemetry-instrumentation-langchain/src/opentelemetry/instrumentation/langchain/callback_handler.py Added _extract_token_usage_from_generations(), _extract_token_usage_from_llm_output(), updated on_llm_end()
instrumentation-genai/opentelemetry-instrumentation-langchain/tests/test_callback_handler_agent.py Added 3 tests for streaming token extraction

…treaming responses

Token usage attributes (gen_ai.usage.input_tokens, gen_ai.usage.output_tokens) were
missing when LLM streaming is enabled because the code only checked llm_output.token_usage.

In streaming mode, LangChain puts token counts in message.usage_metadata instead.
This fix adds priority-based extraction:
1. First check message.usage_metadata (streaming mode)
2. Fallback to llm_output.token_usage (non-streaming mode)

Adds two helper functions:
- _extract_token_usage_from_generations(): extracts from usage_metadata
- _extract_token_usage_from_llm_output(): extracts from llm_output

Tests added:
- test_token_usage_extraction_streaming_mode
- test_token_usage_extraction_non_streaming_mode
- test_token_usage_streaming_priority

Affects: OpenAI, Anthropic, Google, and other providers using streaming with usage_metadata
Before fix (trace_streaming_before_fix.json):
- Trace ID: 77432872a967d4321701ce1f22032d8c
- gen_ai.usage.input_tokens: MISSING
- gen_ai.usage.output_tokens: MISSING

After fix (trace_streaming_after_fix.json):
- Trace ID: 303595c0d1031acdae9bacd46083d87b
- gen_ai.usage.input_tokens: 40
- gen_ai.usage.output_tokens: 57
@NikitaVoitov NikitaVoitov requested review from a team as code owners January 13, 2026 13:58
@github-actions
Copy link


Thank you for your submission, we really appreciate it. Like many open-source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution. You can sign the CLA by just posting a Pull Request Comment same as the below format.


I have read the CLA Document and I hereby sign the CLA


You can retrigger this bot by commenting recheck in this Pull Request. Posted by the CLA Assistant Lite bot.

Copy link
Contributor

@zhirafovod zhirafovod left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@NikitaVoitov , thank you for creating the PR!

can you add the real app which you used to get this traces? I am specifically trying to understand when

usage.get("prompt_tokens") or usage.get("input_tokens")
``` can be the use case? 

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Bug] gen_ai.usage.input_tokens and output_tokens missing when LLM streaming is used

2 participants