[Bug]: Repetition, Frequency, and Presence Penalty Not Applied in TensorRT-LLM Completions API

### System Info

- NVIDIA H100 80GB
- Docker images tested: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5, nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3


### Who can help?

_No response_

### Information

- [ ] The official example scripts
- [ ] My own modified scripts

### Tasks

- [ ] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

When using nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5 or nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3, the repetition_penalty, frequency_penalty, and presence_penalty parameters in the Completions API do not have any effect.

Steps to Reproduce:

Using the OpenAI Python client (AsyncOpenAI):
```python
openai_payload = {
    "model": model_input_payload["model"],
    "stream": model_input_payload["stream"],
    "n": model_input_payload["n"],
    "prompt": model_input_payload["prompt"],
    "max_tokens": model_input_payload["max_tokens"],
    "temperature": model_input_payload["temperature"],
    "top_p": model_input_payload["top_p"],
    "frequency_penalty": 2,
    "presence_penalty": 2,
    "seed": None,
}

extra_body = {
    "detokenize": model_input_payload["detokenize"],
    "min_tokens": model_input_payload["min_tokens"],
    "stop_token_ids": model_input_payload["stop_token_ids"],
    "repetition_penalty": 100,
    "top_k": model_input_payload["top_k"],
}

client = AsyncOpenAI(
    base_url=f"{daemon.protocol}://{daemon.domain}:{model.port}/v1",
    api_key="dummy-key",
    http_client=DefaultAioHttpClient(),
)

resp = await client.completions.create(**openai_payload, extra_body=extra_body)
```

Using requests.post:
```python
url = f"{daemon.protocol}://{daemon.domain}:{model.port}/v1/completions"
headers = {
    "Authorization": f"Bearer dummy-key",
    "Content-Type": "application/json",
}
payload = {**openai_payload, **extra_body}

resp = requests.post(url, json=payload, headers=headers)
resp.raise_for_status()
resp_data = resp.json()
```

### Expected behavior

The penalties should influence token selection according to the usual OpenAI-style behavior.


### actual behavior

The output does not change regardless of the values of repetition_penalty, frequency_penalty, or presence_penalty.

Setting extreme values (e.g., repetition_penalty=100, frequency_penalty=2) does not affect the generated output at all.


### additional notes

Nothing

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Repetition, Frequency, and Presence Penalty Not Applied in TensorRT-LLM Completions API #9442

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Repetition, Frequency, and Presence Penalty Not Applied in TensorRT-LLM Completions API #9442

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions