[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5

### System Info

TensorRT-LLM Version: RC1.1.0rc5

Model: Qwen/Qwen3-14B (same issue occurs with other models)

Command: trtllm-serve /data/Qwen3-14B/  --port 8000  --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.9 --extra_llm_api_options default_config.yaml
default_config.yaml :
enable_iter_req_stats: True
return_perf_metrics: True
enable_chunked_prefill: True
enable_iter_perf_stats: True
guided_decoding_backend: xgrammar

API Client: OpenAI Python Client

Base Image: TensorRT-LLM_rc1.1.0rc5

GPU:H20

### Who can help?

@juney-nvidia  @Tracin @laikhtewari 

### Information

- [ ] The official example scripts
- [x] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [x] My own task or dataset (give details below)

### Reproduction

Use the provided code example

Set frequency_penalty = 2.0

Observe the occurrence of repeated vocabulary in model output

Compare outputs with different frequency_penalty values (0, 1.0, 2.0) and notice no significant differences

```python
import openai
import httpx

client = openai.OpenAI(
    base_url="http://localhost:9823/v1",
    api_key="",
    http_client=httpx.Client(verify=False)
)

response = client.chat.completions.create(
    model="Qwen3-14B",
    messages=[
        {"role": "system", "content": "Translate from English into Ukrainian."},
        {"role": "user", "content": "<p>As per Bijié Wǎng, Bitcoin price continues to face downward pressure...</p>"}
    ],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}
    },
    frequency_penalty=2.0,  # ⚠️ This parameter is not working
    stream_options={"include_usage": False},
    temperature=0,
    top_p=1,
    stream=True
)
``` 

### Expected behavior

Setting frequency_penalty=2.0 should significantly reduce repeated vocabulary

Higher penalty values should prevent the model from reusing already appeared tokens

Vocabulary diversity in output text should be noticeably improved

### actual behavior

Output results remain largely identical regardless of frequency_penalty value (0, 1.0, 2.0)

Repeated vocabulary continues to appear frequently

Parameter adjustments have no noticeable impact on output quality

### additional notes

The issue persists in both streaming and non-streaming modes

The same code works properly with other inference frameworks (e.g., vLLM SGLang)

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5 #9364

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5 #9364

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions