[Usage]: Pytorch Workflow Add Multimodal Embedding into Qwen3

### System Info

**System Information:**
- OS: Ubuntu
- Python version: 3.12
- CUDA version: 12.8
- GPU model(s):  4090
- Driver version: 570.124.04
- TensorRT-LLM version:  1.1.0rc1

**Detailed output:**
```text
Python 3.12.3
Name: tensorrt_llm
Version: 1.1.0rc1
Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
Home-page: https://github.com/NVIDIA/TensorRT-LLM
Author: NVIDIA Corporation
Author-email: 
License: Apache License 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, aenum, backoff, blake3, blobfile, build, click, click_option_group, colored, cuda-python, datasets, diffusers, einops, etcd3, evaluate, fastapi, flashinfer-python, h5py, jsonschema, lark, llguidance, matplotlib, meson, mpi4py, mpmath, ninja, numpy, nvidia-cuda-nvrtc-cu12, nvidia-ml-py, nvidia-modelopt, nvidia-nccl-cu12, nvtx, omegaconf, onnx, onnx_graphsurgeon, openai, opencv-python-headless, optimum, ordered-set, pandas, peft, pillow, polygraphy, prometheus_client, prometheus_fastapi_instrumentator, protobuf, psutil, pulp, pydantic, pydantic-settings, pynvml, pyzmq, sentencepiece, setuptools, soundfile, StrEnum, tensorrt, tiktoken, torch, torchvision, transformers, triton, uvicorn, wheel, xgrammar
```


### How would you like to use TensorRT-LLM


### Question 1

I want to add multimodal embeddings to Qwen3, so I intend to use the data from the multimodal part when the input_id is greater than or equal to the vocab_size. However, when there are ids in my input_ids that are larger than the vocab_size, an error occurs somewhere (I haven't found the cause of the error, and it seems that it hasn't even entered my forward function).

我想在qwen3上加上多模emb，所以想沿用之前的input_id>=vocab_size就取multimodal里的数据。
但当我的input_ids里有大于vocab_size的id后，就会在某个地方报错(我没有找到错误的原因，好像还没有进入我的forward函数）。


[08/26/2025-14:10:33] [TRT-LLM] [E] Error in thread proxy_dispatch_result_thread: 2
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/utils.py", line 268, in run
    if not task(**self.kwargs):
           ^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 191, in dispatch_result_task
    process_res(i)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 172, in process_res
    queue = self._results[client_id].queue
            ~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 2

It seems that I could not do any operation to the input_ids(even copy).

```python

class MyQwen3Model(Qwen3Model):

    def __init__(self, model_config: ModelConfig[Qwen3Config]):
        super().__init__(model_config)
        logger.info(f"MyQwen3Model init..........................")

    def forward(
        self,
        attn_metadata: AttentionMetadata,
        input_ids: Optional[torch.IntTensor] = None,
        position_ids: Optional[torch.IntTensor] = None,
        inputs_embeds: Optional[torch.FloatTensor] = None,
        mrope_config: Optional[Tuple[torch.Tensor, int]] = None,
        spec_metadata: Optional[SpecMetadata] = None,
        **kwargs,
    ) -> torch.Tensor:
        if (input_ids is None) ^ (inputs_embeds is not None):
            raise ValueError(
                "You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
            )
        logger.info("forward start")
        #TODO: replace input_embeds to mmodal embedding

        if inputs_embeds is None:
            inputs_embeds = self.embed_tokens(input_ids)

        
        hidden_states = inputs_embeds
        if 'multimodal_params' in kwargs:
            multimodal_params: List[MultimodalParams] = kwargs['multimodal_params']
            logger.info(f"multimodal_params: {len(multimodal_params)}")
            for mm_param in multimodal_params:
                logger.info(f"mm_param: {mm_param.multimodal_data}")
                mm_param.multimodal_data = None


        residual = None
        for decoder_layer in self.layers:
            hidden_states, residual = decoder_layer(
                position_ids=position_ids,
                hidden_states=hidden_states,
                attn_metadata=attn_metadata,
                residual=residual,
                mrope_config=mrope_config,
                spec_metadata=spec_metadata,
            )

        hidden_states, _ = self.norm(hidden_states, residual)
        return hidden_states


@register_auto_model("Qwen3ForCausalLM")
class MyQwen3ForCausalLM(SpecDecOneEngineForCausalLM[MyQwen3Model, Qwen3Config]):

    def __init__(
        self,
        model_config: ModelConfig[Qwen3Config],
    ):
        super().__init__(
            MyQwen3Model(model_config),
            model_config,
        )

```

### Question 2
One problem I want to solve is that after using multimodality, the same token_id may have different embeddings. However, when enable_block_reuse is used, these ids may be cached, resulting in different embeddings being unable to be passed to this token_id again. But in our scenario, multimodal tokens are few in number and not in very early positions.

Is there a solution that can both use **enable_block_reuse** and prevent certain token_ids(and following input_ids)  from being cached?

我想解决的一个问题是，使用多模后，相同的token_id可能拥有不同的embedding，但使用enable_block_reuse后，这些id可能会被cache，导致不同的embedding无法被再次传给这个token_id。但我们场景多模token是少数，而且也不是在很靠前的位置。

有没有既能使用 enable_block_reuse，又能让某些token_id之后不被cache的方案？




### Question 3

It seems that in the _prepare_tp_inputs method of PyTorchModelEngine, there might be an issue where the multimodal_params_list array does not correctly correspond to the requests. This makes it difficult to distinguish which multimodal parameter (mmodal) belongs to which request in the model's forward function.

PyTorchModelEngine里_prepare_tp_inputs里，感觉会导致multimodal_params_list数组和request对应不上，在模型的forward里，不太好区分某个mmodal是对应哪个request的。（虽然可以通过input_ids长度是否为1判断是prefill还是decode，但也不是所有的prefill都会有多模信息）。

```python
 for request in scheduled_requests.context_requests:
            request_ids.append(request.py_request_id)
            all_prompt_tokens = request.get_tokens(0)
            draft_lens.append(0)
            begin_compute = request.context_current_position
            end_compute = begin_compute + request.context_chunk_size
            prompt_tokens = all_prompt_tokens[begin_compute:end_compute]
            position_ids.extend(
                range(begin_compute, begin_compute + len(prompt_tokens)))
            input_ids.extend(prompt_tokens)
            gather_ids.append(len(input_ids) - 1)
            sequence_lengths.append(len(prompt_tokens))
            prompt_lengths.append(len(prompt_tokens))
            past_seen_token_num = begin_compute
            num_cached_tokens_per_seq.append(past_seen_token_num)

            # Multimodal
            # TODO: enable chunk prefill for multimodal (maybe need to pass prompt_tokens to MultimodalRuntimeData)
            py_multimodal_runtime = MultimodalRuntimeData(
                mm_token_lengths=request.multimodal_lengths,
                mm_token_positions=request.multimodal_positions,
                num_cached_tokens=past_seen_token_num
            ) if request.multimodal_hashes is not None else None

            multimodal_params = MultimodalParams(
                multimodal_data=request.py_multimodal_data,
                multimodal_runtime=py_multimodal_runtime)

            if multimodal_params.has_content():
                multimodal_params.to_device("multimodal_data",
                                            "cuda",
                                            pin_memory=True)
                #re-assign the multimodal_data to the request after to_device for generation requests
                request.py_multimodal_data = multimodal_params.multimodal_data
                multimodal_params_list.append(multimodal_params)

```


### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: Pytorch Workflow Add Multimodal Embedding into Qwen3 #7246

System Info

How would you like to use TensorRT-LLM

Question 1

Question 2

Question 3

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Usage]: Pytorch Workflow Add Multimodal Embedding into Qwen3 #7246

Description

System Info

How would you like to use TensorRT-LLM

Question 1

Question 2

Question 3

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions