Skip to content

Usage of max_tokens is ambiguous, deprecated and causing errors. #262

@khaotik

Description

@khaotik

Hello, I'm having an issue while trying to setup this project with gemini-2.5-flash model.

TLDR: The usage of max_tokens parameter in OpenAI chat completion API is deprecated by OpenAI, is causing errors due to its ambiguous interpretation.


I am using the latest git clone of this project.
After a successfully run solving a simple problem, I tried launching server with:

export OPENAI_API_KEY=<omitted-gemini-api-key>
python optillm.py --base-url https://generativelanguage.googleapis.com/v1beta/openai/ --model gemini-2.5-flash

Then I gave it a slightly sophiscated question via localhost API call:

import openai

api = openai.OpenAI(api_key='', base_url='http://127.0.0.1:8000/v1')

question = "What is a simple closed-form recurrence relation for resultant polynomial $P_m(z) = Res_x(x^p-1, z - x^m + x)$ with regard to positive integer $m$, modulo polynomial ring $x^p-1$ where $p$ is a prime number? This recurrence relation should contain enough information to compute $P_m(z)$ for any $m$"

resp = api.chat.completions.create(
    model='gemini-2.5-flash',
    messages=[{
        "role": "user",
        "content": question
    }],
    extra_body={'optillm_approach': 'deepthink'}
)

Upon running, I see a lot errors such as:

2025-10-24 04:19:16,577 - ERROR - None: Error processing request: 'NoneType' object has no attribute 'strip'

Upon deeper investigation, I've found that Gemini will treat max_tokens as count of both reasoning tokens and output tokens.
There are a few places in this proxy, where it is required to use a moderate amount of reasoning token but only a short output.
An example is at here

I'm guessing some LLM providers will treat max_tokens as visible output tokens only. However, after a bit experimentation, I believe Gemini's OAI endpoint will treat max_tokens as sum of reasoning tokens and output tokens.
Whenever a provider is using the latter interpretation, the output is likely truncated due to long reasoning.

The OpenAI docs says:

max_tokens Deprecated integer or null

Optional
The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API.

This value is now deprecated in favor of max_completion_tokens, and is not compatible with o-series models.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions