- 
                Notifications
    You must be signed in to change notification settings 
- Fork 37
Description
What would you like to be added:
When generate tokens, respect ignore_eos
Why is this needed:
llm-d-inference-sim/pkg/common/utils.go
Line 176 in 639b40e
| maxTokens := int(*maxCompletionTokens) | 
Currently, when generating tokens,
max_tokens or max_completion_tokens is used to validate the length of the generated output. However, the ignore_eos parameter is not taken into account.
When evaluating model performance on production request datasets, it's crucial to force models to generate a precise number of tokens for a fair comparison. Setting ignore_eos to true ensures that the output will be exactly max_tokens (or max_completion_tokens) long, mimicking the behavior of real-world inference services more accurately.
Therefore, we need to modify the token generation logic to properly handle the ignore_eos parameter. This will allow for consistent and reproducible evaluations by ensuring the number of output tokens is equal to the specified maximum length when ignore_eos is true.