-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
System Information:
- OS: Ubuntu
- Python version: 3.12
- CUDA version: 12.8
- GPU model(s): 4090
- Driver version: 570.124.04
- TensorRT-LLM version: 1.1.0rc1
Detailed output:
Python 3.12.3
Name: tensorrt_llm
Version: 1.1.0rc1
Summary: TensorRT-LLM: A TensorRT Toolbox for Large Language Models
Home-page: https://github.com/NVIDIA/TensorRT-LLM
Author: NVIDIA Corporation
Author-email:
License: Apache License 2.0
Location: /usr/local/lib/python3.12/dist-packages
Requires: accelerate, aenum, backoff, blake3, blobfile, build, click, click_option_group, colored, cuda-python, datasets, diffusers, einops, etcd3, evaluate, fastapi, flashinfer-python, h5py, jsonschema, lark, llguidance, matplotlib, meson, mpi4py, mpmath, ninja, numpy, nvidia-cuda-nvrtc-cu12, nvidia-ml-py, nvidia-modelopt, nvidia-nccl-cu12, nvtx, omegaconf, onnx, onnx_graphsurgeon, openai, opencv-python-headless, optimum, ordered-set, pandas, peft, pillow, polygraphy, prometheus_client, prometheus_fastapi_instrumentator, protobuf, psutil, pulp, pydantic, pydantic-settings, pynvml, pyzmq, sentencepiece, setuptools, soundfile, StrEnum, tensorrt, tiktoken, torch, torchvision, transformers, triton, uvicorn, wheel, xgrammar
How would you like to use TensorRT-LLM
Question 1
I want to add multimodal embeddings to Qwen3, so I intend to use the data from the multimodal part when the input_id is greater than or equal to the vocab_size. However, when there are ids in my input_ids that are larger than the vocab_size, an error occurs somewhere (I haven't found the cause of the error, and it seems that it hasn't even entered my forward function).
我想在qwen3上加上多模emb,所以想沿用之前的input_id>=vocab_size就取multimodal里的数据。
但当我的input_ids里有大于vocab_size的id后,就会在某个地方报错(我没有找到错误的原因,好像还没有进入我的forward函数)。
[08/26/2025-14:10:33] [TRT-LLM] [E] Error in thread proxy_dispatch_result_thread: 2
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/llmapi/utils.py", line 268, in run
if not task(**self.kwargs):
^^^^^^^^^^^^^^^^^^^
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 191, in dispatch_result_task
process_res(i)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/proxy.py", line 172, in process_res
queue = self._results[client_id].queue
~~~~~~~~~~~~~^^^^^^^^^^^
KeyError: 2
It seems that I could not do any operation to the input_ids(even copy).
class MyQwen3Model(Qwen3Model):
def __init__(self, model_config: ModelConfig[Qwen3Config]):
super().__init__(model_config)
logger.info(f"MyQwen3Model init..........................")
def forward(
self,
attn_metadata: AttentionMetadata,
input_ids: Optional[torch.IntTensor] = None,
position_ids: Optional[torch.IntTensor] = None,
inputs_embeds: Optional[torch.FloatTensor] = None,
mrope_config: Optional[Tuple[torch.Tensor, int]] = None,
spec_metadata: Optional[SpecMetadata] = None,
**kwargs,
) -> torch.Tensor:
if (input_ids is None) ^ (inputs_embeds is not None):
raise ValueError(
"You cannot specify both input_ids and inputs_embeds at the same time, and must specify either one"
)
logger.info("forward start")
#TODO: replace input_embeds to mmodal embedding
if inputs_embeds is None:
inputs_embeds = self.embed_tokens(input_ids)
hidden_states = inputs_embeds
if 'multimodal_params' in kwargs:
multimodal_params: List[MultimodalParams] = kwargs['multimodal_params']
logger.info(f"multimodal_params: {len(multimodal_params)}")
for mm_param in multimodal_params:
logger.info(f"mm_param: {mm_param.multimodal_data}")
mm_param.multimodal_data = None
residual = None
for decoder_layer in self.layers:
hidden_states, residual = decoder_layer(
position_ids=position_ids,
hidden_states=hidden_states,
attn_metadata=attn_metadata,
residual=residual,
mrope_config=mrope_config,
spec_metadata=spec_metadata,
)
hidden_states, _ = self.norm(hidden_states, residual)
return hidden_states
@register_auto_model("Qwen3ForCausalLM")
class MyQwen3ForCausalLM(SpecDecOneEngineForCausalLM[MyQwen3Model, Qwen3Config]):
def __init__(
self,
model_config: ModelConfig[Qwen3Config],
):
super().__init__(
MyQwen3Model(model_config),
model_config,
)Question 2
One problem I want to solve is that after using multimodality, the same token_id may have different embeddings. However, when enable_block_reuse is used, these ids may be cached, resulting in different embeddings being unable to be passed to this token_id again. But in our scenario, multimodal tokens are few in number and not in very early positions.
Is there a solution that can both use enable_block_reuse and prevent certain token_ids(and following input_ids) from being cached?
我想解决的一个问题是,使用多模后,相同的token_id可能拥有不同的embedding,但使用enable_block_reuse后,这些id可能会被cache,导致不同的embedding无法被再次传给这个token_id。但我们场景多模token是少数,而且也不是在很靠前的位置。
有没有既能使用 enable_block_reuse,又能让某些token_id之后不被cache的方案?
Question 3
It seems that in the _prepare_tp_inputs method of PyTorchModelEngine, there might be an issue where the multimodal_params_list array does not correctly correspond to the requests. This makes it difficult to distinguish which multimodal parameter (mmodal) belongs to which request in the model's forward function.
PyTorchModelEngine里_prepare_tp_inputs里,感觉会导致multimodal_params_list数组和request对应不上,在模型的forward里,不太好区分某个mmodal是对应哪个request的。(虽然可以通过input_ids长度是否为1判断是prefill还是decode,但也不是所有的prefill都会有多模信息)。
for request in scheduled_requests.context_requests:
request_ids.append(request.py_request_id)
all_prompt_tokens = request.get_tokens(0)
draft_lens.append(0)
begin_compute = request.context_current_position
end_compute = begin_compute + request.context_chunk_size
prompt_tokens = all_prompt_tokens[begin_compute:end_compute]
position_ids.extend(
range(begin_compute, begin_compute + len(prompt_tokens)))
input_ids.extend(prompt_tokens)
gather_ids.append(len(input_ids) - 1)
sequence_lengths.append(len(prompt_tokens))
prompt_lengths.append(len(prompt_tokens))
past_seen_token_num = begin_compute
num_cached_tokens_per_seq.append(past_seen_token_num)
# Multimodal
# TODO: enable chunk prefill for multimodal (maybe need to pass prompt_tokens to MultimodalRuntimeData)
py_multimodal_runtime = MultimodalRuntimeData(
mm_token_lengths=request.multimodal_lengths,
mm_token_positions=request.multimodal_positions,
num_cached_tokens=past_seen_token_num
) if request.multimodal_hashes is not None else None
multimodal_params = MultimodalParams(
multimodal_data=request.py_multimodal_data,
multimodal_runtime=py_multimodal_runtime)
if multimodal_params.has_content():
multimodal_params.to_device("multimodal_data",
"cuda",
pin_memory=True)
#re-assign the multimodal_data to the request after to_device for generation requests
request.py_multimodal_data = multimodal_params.multimodal_data
multimodal_params_list.append(multimodal_params)Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.