[Bug]: MS Phi-4 Multimodal requires stop_words_list handling to prevent unnecessary token generation

### System Info

Device: Jetson Orin AGX (64GB)

OS / SDK: JetPack 6.2.1, Ubuntu 22.04

TensorRT-LLM: v0.21

Model: MS Phi-4 Multimodal

### Who can help?

_No response_

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

run the example of Phi-4-multimodal with --run_profiling option

### Expected behavior

The model generates only response to my prompt.

### actual behavior

Regardless of the input prompt, the model always generates exactly max_new_tokens tokens.
After expected response, irregular texts are appended.
Below is one of exmple.
'The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore\'s skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore\'s skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore\'s skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore\'s skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye'

### additional notes

Root Cause

In MS Phi-4 Multimodal, when the generation should stop, the model returns the <|USER|> token (200020) instead of the standard EOS token (eos_id=199999).

Currently, the MultimodalModelRunner class does not handle this case, so generation continues until max_new_tokens is reached, causing unnecessary tokens to be produced.

Proposed Fix

In the MultimodalModelRunner class, when using MS Phi-4 Multimodal, add the argument stop_words_list=[[[200020]]] to the self.model.generate call so generation stops correctly when <|USER|> is encountered.

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: MS Phi-4 Multimodal requires stop_words_list handling to prevent unnecessary token generation #7306

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: MS Phi-4 Multimodal requires stop_words_list handling to prevent unnecessary token generation #7306

Description

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

actual behavior

additional notes

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions