-
Notifications
You must be signed in to change notification settings - Fork 2.3k
Closed as not planned
Closed as not planned
Copy link
Labels
Module:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules
Description
Description
TensorRT-LLM fails during weight loading for GLM 4.6 model with MTP speculative decoding enabled. Error occurs in load_weights_fused_qkv_helper with assertion failure:
AssertionError: assert all('weight' in weights[i] for i in range(3))
Weight loading reaches 99% (2700/2737) before failing.
Environment
TensorRT Version: TensorRT-LLM 1.2.0rc5 (nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5)
NVIDIA GPU: NVIDIA B200 (8x GPUs)
NVIDIA Driver Version: 580.95.05
CUDA Version: 12.8 (V12.8.61)
CUDNN Version:
Operating System: Linux
Python Version: 3.12
Tensorflow Version: N/A
PyTorch Version:
Baremetal or Container: Container (Docker)
Relevant Files
Model link: GLM-based model with MTP configuration. When the speculative_config is removed the model is able to be served without errors
Configuration:
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
use_relaxed_acceptance_for_thinking: true
# Other config
arguments_as_json: true
backend: pytorch
cuda_graph_config:
enable_padding: false
enable_chunked_prefill: true
enable_iter_perf_stats: true
guided_decoding_backend: xgrammar
kv_cache_config:
dtype: fp8
enable_block_reuse: true
enable_partial_reuse: false
event_buffer_max_size: 16384
free_gpu_memory_fraction: 0.8
host_cache_size: 100000000000
max_batch_size: 24
max_beam_width: 1
max_input_len: 202752
max_num_tokens: 8192
max_seq_len: 202752
model_level_stop_words:
- <|user|>
- <|observation|>
model_name: baseten-admin/glm-4.6-fp4-mlp
moe_expert_parallel_size: 4
served_model_name: zai-org/GLM-4.6
tensor_parallel_size: 8
tokenizer_limit_length: 202752
tool_call_parser: glm45Steps To Reproduce
Commands or scripts:
- Configure GLM model with MTP speculative decoding using config above
- Initialize executor with distributed setup
- Weight loading fails at 99% completion
Full traceback:
Loading weights: 99%|█████████▊| 2700/2737 [00:21<00:00, 123.12it/s]
[TRT-LLM] [E] Failed to initialize executor:
Traceback (most recent call last):
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 365, in worker_main
worker: GenerationExecutorWorker = worker_cls(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 63, in __init__
self.setup_engine()
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 257, in setup_engine
self.engine = _create_py_executor(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 228, in _create_py_executor
_executor = create_executor(**args)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 338, in create_py_executor
model_engine = PyTorchModelEngine(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 204, in __init__
self.model, moe_load_balancer = self.model_loader.load(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 275, in load
self._call_load_weights(model.load_weights, weights,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 400, in _call_load_weights
load_method(weights, **kargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_glm.py", line 877, in load_weights
_load_weights_impl(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 935, in _load_weights_impl
load_single_module(name, module)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 907, in load_single_module
module.load_weights(weights=module_weights,
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 2283, in load_weights
self.quant_method.load_weights(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 324, in load_weights
self.load_weights_fused_qkv_linear(module, weights, **kargs)
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 409, in load_weights_fused_qkv_linear
q_weight, k_weight, v_weight = load_weights_fused_qkv_helper(
File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 183, in load_weights_fused_qkv_helper
assert all('weight' in weights[i] for i in range(3))
AssertionError
Metadata
Metadata
Assignees
Labels
Module:RuntimeOther generic runtime issues that does not fall into other modulesOther generic runtime issues that does not fall into other modules