Skip to content

[Bug]: frequency_penalty Parameter Not Working in TensorRT-LLM RC1.1.0rc5 #9364

@0xd8b

Description

@0xd8b

System Info

TensorRT-LLM Version: RC1.1.0rc5

Model: Qwen/Qwen3-14B (same issue occurs with other models)

Command: trtllm-serve /data/Qwen3-14B/ --port 8000 --host 0.0.0.0 --kv_cache_free_gpu_memory_fraction 0.9 --extra_llm_api_options default_config.yaml
default_config.yaml :
enable_iter_req_stats: True
return_perf_metrics: True
enable_chunked_prefill: True
enable_iter_perf_stats: True
guided_decoding_backend: xgrammar

API Client: OpenAI Python Client

Base Image: TensorRT-LLM_rc1.1.0rc5

GPU:H20

Who can help?

@juney-nvidia @Tracin @laikhtewari

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Use the provided code example

Set frequency_penalty = 2.0

Observe the occurrence of repeated vocabulary in model output

Compare outputs with different frequency_penalty values (0, 1.0, 2.0) and notice no significant differences

import openai
import httpx

client = openai.OpenAI(
    base_url="http://localhost:9823/v1",
    api_key="",
    http_client=httpx.Client(verify=False)
)

response = client.chat.completions.create(
    model="Qwen3-14B",
    messages=[
        {"role": "system", "content": "Translate from English into Ukrainian."},
        {"role": "user", "content": "<p>As per Bijié Wǎng, Bitcoin price continues to face downward pressure...</p>"}
    ],
    extra_body={
        "chat_template_kwargs": {"enable_thinking": False}
    },
    frequency_penalty=2.0,  # ⚠️ This parameter is not working
    stream_options={"include_usage": False},
    temperature=0,
    top_p=1,
    stream=True
)

Expected behavior

Setting frequency_penalty=2.0 should significantly reduce repeated vocabulary

Higher penalty values should prevent the model from reusing already appeared tokens

Vocabulary diversity in output text should be noticeably improved

actual behavior

Output results remain largely identical regardless of frequency_penalty value (0, 1.0, 2.0)

Repeated vocabulary continues to appear frequently

Parameter adjustments have no noticeable impact on output quality

additional notes

The issue persists in both streaming and non-streaming modes

The same code works properly with other inference frameworks (e.g., vLLM SGLang)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

Labels

Decoding/Sampling<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions