-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Closed as not planned
Closed as not planned
Copy link
Labels
Decoding/Sampling<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).bugSomething isn't workingSomething isn't workingstalewaiting for feedback
Description
System Info
- NVIDIA H100 80GB
- Docker images tested: nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5, nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3
Who can help?
No response
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
When using nvcr.io/nvidia/tensorrt-llm/release:1.1.0rc5 or nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc3, the repetition_penalty, frequency_penalty, and presence_penalty parameters in the Completions API do not have any effect.
Steps to Reproduce:
Using the OpenAI Python client (AsyncOpenAI):
openai_payload = {
"model": model_input_payload["model"],
"stream": model_input_payload["stream"],
"n": model_input_payload["n"],
"prompt": model_input_payload["prompt"],
"max_tokens": model_input_payload["max_tokens"],
"temperature": model_input_payload["temperature"],
"top_p": model_input_payload["top_p"],
"frequency_penalty": 2,
"presence_penalty": 2,
"seed": None,
}
extra_body = {
"detokenize": model_input_payload["detokenize"],
"min_tokens": model_input_payload["min_tokens"],
"stop_token_ids": model_input_payload["stop_token_ids"],
"repetition_penalty": 100,
"top_k": model_input_payload["top_k"],
}
client = AsyncOpenAI(
base_url=f"{daemon.protocol}://{daemon.domain}:{model.port}/v1",
api_key="dummy-key",
http_client=DefaultAioHttpClient(),
)
resp = await client.completions.create(**openai_payload, extra_body=extra_body)Using requests.post:
url = f"{daemon.protocol}://{daemon.domain}:{model.port}/v1/completions"
headers = {
"Authorization": f"Bearer dummy-key",
"Content-Type": "application/json",
}
payload = {**openai_payload, **extra_body}
resp = requests.post(url, json=payload, headers=headers)
resp.raise_for_status()
resp_data = resp.json()Expected behavior
The penalties should influence token selection according to the usual OpenAI-style behavior.
actual behavior
The output does not change regardless of the values of repetition_penalty, frequency_penalty, or presence_penalty.
Setting extreme values (e.g., repetition_penalty=100, frequency_penalty=2) does not affect the generated output at all.
additional notes
Nothing
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Labels
Decoding/Sampling<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).<NV>Token sampling algorithms in TRTLLM for text gen (top-k, top-p, beam).bugSomething isn't workingSomething isn't workingstalewaiting for feedback