Usage of `max_tokens` is ambiguous, deprecated and causing errors.

Hello, I'm having an issue while trying to setup this project with `gemini-2.5-flash` model.

TLDR: The usage of `max_tokens` parameter in OpenAI chat completion API is [deprecated](https://platform.openai.com/docs/api-reference/chat/create) by OpenAI, is causing errors due to its ambiguous interpretation.

---

I am using the latest git clone of this project.
After a successfully run solving a simple problem, I tried launching server with:

```text
export OPENAI_API_KEY=<omitted-gemini-api-key>
python optillm.py --base-url https://generativelanguage.googleapis.com/v1beta/openai/ --model gemini-2.5-flash
```

Then I gave it a slightly sophiscated question via localhost API call:

```python
import openai

api = openai.OpenAI(api_key='', base_url='http://127.0.0.1:8000/v1')

question = "What is a simple closed-form recurrence relation for resultant polynomial $P_m(z) = Res_x(x^p-1, z - x^m + x)$ with regard to positive integer $m$, modulo polynomial ring $x^p-1$ where $p$ is a prime number? This recurrence relation should contain enough information to compute $P_m(z)$ for any $m$"

resp = api.chat.completions.create(
    model='gemini-2.5-flash',
    messages=[{
        "role": "user",
        "content": question
    }],
    extra_body={'optillm_approach': 'deepthink'}
)
```

Upon running, I see a lot errors such as:

```text
2025-10-24 04:19:16,577 - ERROR - None: Error processing request: 'NoneType' object has no attribute 'strip'
```

Upon deeper investigation, I've found that Gemini will treat `max_tokens` as count of **both** reasoning tokens and output tokens.
There are a few places in this proxy, where it is required to use a moderate amount of reasoning token but only a short output.
An example is at [here](https://github.com/codelion/optillm/blob/main/optillm/plugins/deepthink/self_discover.py#L95)

I'm guessing some LLM providers will treat `max_tokens` as *visible output tokens only*. However, after a bit experimentation, I believe Gemini's OAI endpoint will treat `max_tokens` as *sum of reasoning tokens and output tokens*.
Whenever a provider is using the latter interpretation, the output is likely truncated due to long reasoning.

The OpenAI docs says:

```
max_tokens Deprecated integer or null

Optional
The maximum number of tokens that can be generated in the chat completion. This value can be used to control costs for text generated via API.

This value is now deprecated in favor of max_completion_tokens, and is not compatible with o-series models.
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Usage of `max_tokens` is ambiguous, deprecated and causing errors. #262

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Usage of max_tokens is ambiguous, deprecated and causing errors. #262

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Usage of `max_tokens` is ambiguous, deprecated and causing errors. #262