Skip to content

Gemini OpenAI API overcounts tokens in streaming mode #5122

@raghotham

Description

@raghotham

System Info

Using Gemini Inference Provider

Information

  • The official example scripts
  • My own modified scripts

🐛 Describe the bug

Bug is described here: https://discuss.ai.google.dev/t/endpoint-https-generativelanguage-googleapis-com-v1beta-openai-chat-completions-is-not-compliant-with-api-specs/127400

Workarounds suggested:

  • For Custom Implementations: Modify your stream-processing loop to not accumulate the usage field. Instead, only capture the usage data from the very last chunk (or the one where finish_reason is not null).

  • For Effect-TS Users: This specific bug was fixed in version 3.14.1 of @effect/ai-openai. You should update to the latest version to handle "arbitrary length StreamChunkParts."

  • For OpenAI SDK Users: Some developers have found that setting stream_options: { include_usage: false } and instead using a separate token_count metadata call is safer until Google aligns its endpoint with the spec.

Error logs

n/a

Expected behavior

Only return usage tokens in the last chunk. dont overcount from each chunk or return usage in each chunk

Metadata

Metadata

Assignees

No one assigned

    Labels

    bugSomething isn't working

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions