Skip to content

[Bug]: Repetition, Frequency, and Presence Penalty Not Applied in TensorRT-LLM Completions API #9442

@akakakakakaa

Description

@akakakakakaa

System Info

  • NVIDIA H100 80GB
  • Docker images tested: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5, nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

When using nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5 or nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3, the repetition_penalty, frequency_penalty, and presence_penalty parameters in the Completions API do not have any effect.

Steps to Reproduce:

Using the OpenAI Python client (AsyncOpenAI):

openai_payload = {
    "model": model_input_payload["model"],
    "stream": model_input_payload["stream"],
    "n": model_input_payload["n"],
    "prompt": model_input_payload["prompt"],
    "max_tokens": model_input_payload["max_tokens"],
    "temperature": model_input_payload["temperature"],
    "top_p": model_input_payload["top_p"],
    "frequency_penalty": 2,
    "presence_penalty": 2,
    "seed": None,
}

extra_body = {
    "detokenize": model_input_payload["detokenize"],
    "min_tokens": model_input_payload["min_tokens"],
    "stop_token_ids": model_input_payload["stop_token_ids"],
    "repetition_penalty": 100,
    "top_k": model_input_payload["top_k"],
}

client = AsyncOpenAI(
    base_url=f"{daemon.protocol}://{daemon.domain}:{model.port}/v1",
    api_key="dummy-key",
    http_client=DefaultAioHttpClient(),
)

resp = await client.completions.create(**openai_payload, extra_body=extra_body)

Using requests.post:

url = f"{daemon.protocol}://{daemon.domain}:{model.port}/v1/completions"
headers = {
    "Authorization": f"Bearer dummy-key",
    "Content-Type": "application/json",
}
payload = {**openai_payload, **extra_body}

resp = requests.post(url, json=payload, headers=headers)
resp.raise_for_status()
resp_data = resp.json()

Expected behavior

The penalties should influence token selection according to the usual OpenAI-style behavior.

actual behavior

The output does not change regardless of the values of repetition_penalty, frequency_penalty, or presence_penalty.

Setting extreme values (e.g., repetition_penalty=100, frequency_penalty=2) does not affect the generated output at all.

additional notes

Nothing

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

Decoding/Sampling<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).bugSomething isn't workingstalewaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions