Skip to content

[Bug]: MS Phi-4 Multimodal requires stop_words_list handling to prevent unnecessary token generation #7306

@moku

Description

@moku

System Info

Device: Jetson Orin AGX (64GB)

OS / SDK: JetPack 6.2.1, Ubuntu 22.04

TensorRT-LLM: v0.21

Model: MS Phi-4 Multimodal

Who can help?

No response

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

run the example of Phi-4-multimodal with --run_profiling option

Expected behavior

The model generates only response to my prompt.

actual behavior

Regardless of the input prompt, the model always generates exactly max_new_tokens tokens.
After expected response, irregular texts are appended.
Below is one of exmple.
'The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore's skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore's skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore's skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye of the Storm" and the "Infinity Pool," illuminated against a backdrop of a beautiful sunset. In the foreground, there is a majestic stone sculpture of the Chinese dragon, known as the "Lion Dance," which is a symbol of good luck and prosperity. The image also features a tranquil water scene with a fountain, creating a serene and picturesque atmosphere. The overall scene captures the beauty and modernity of Singapore's skyline at dusk.The image shows a stunning view of the Marina Bay Sands in Singapore, with the iconic Marina Bay Sands hotel and its two iconic towers, known as the "Eye'

additional notes

Root Cause

In MS Phi-4 Multimodal, when the generation should stop, the model returns the <|USER|> token (200020) instead of the standard EOS token (eos_id=199999).

Currently, the MultimodalModelRunner class does not handle this case, so generation continues until max_new_tokens is reached, causing unnecessary tokens to be produced.

Proposed Fix

In the MultimodalModelRunner class, when using MS Phi-4 Multimodal, add the argument stop_words_list=[[[200020]]] to the self.model.generate call so generation stops correctly when <|USER|> is encountered.

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Labels

Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.InvestigatingMultimodalLabel for issues & PRs regarding Multimodal related objectsbugSomething isn't workingtriagedIssue has been triaged by maintainerswaiting for feedback

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions