Skip to content

MTP weight loading failure of TensorRT-LLM 1.2.0rc5 when running GLM model on GPU NVIDIA B200 #4667

@Wirick

Description

@Wirick

Description

TensorRT-LLM fails during weight loading for GLM 4.6 model with MTP speculative decoding enabled. Error occurs in load_weights_fused_qkv_helper with assertion failure:

AssertionError: assert all('weight' in weights[i] for i in range(3))

Weight loading reaches 99% (2700/2737) before failing.

Environment

TensorRT Version: TensorRT-LLM 1.2.0rc5 (nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5)
NVIDIA GPU: NVIDIA B200 (8x GPUs)
NVIDIA Driver Version: 580.95.05
CUDA Version: 12.8 (V12.8.61)
CUDNN Version:
Operating System: Linux
Python Version: 3.12
Tensorflow Version: N/A
PyTorch Version:
Baremetal or Container: Container (Docker)

Relevant Files

Model link: GLM-based model with MTP configuration. When the speculative_config is removed the model is able to be served without errors

Configuration:

speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  use_relaxed_acceptance_for_thinking: true

# Other config
arguments_as_json: true
backend: pytorch
cuda_graph_config:
  enable_padding: false
enable_chunked_prefill: true
enable_iter_perf_stats: true
guided_decoding_backend: xgrammar
kv_cache_config:
  dtype: fp8
  enable_block_reuse: true
  enable_partial_reuse: false
  event_buffer_max_size: 16384
  free_gpu_memory_fraction: 0.8
  host_cache_size: 100000000000
max_batch_size: 24
max_beam_width: 1
max_input_len: 202752
max_num_tokens: 8192
max_seq_len: 202752
model_level_stop_words:
- <|user|>
- <|observation|>
model_name: baseten-admin/glm-4.6-fp4-mlp
moe_expert_parallel_size: 4
served_model_name: zai-org/GLM-4.6
tensor_parallel_size: 8
tokenizer_limit_length: 202752
tool_call_parser: glm45

Steps To Reproduce

Commands or scripts:

  1. Configure GLM model with MTP speculative decoding using config above
  2. Initialize executor with distributed setup
  3. Weight loading fails at 99% completion

Full traceback:

Loading weights:  99%|█████████▊| 2700/2737 [00:21<00:00, 123.12it/s]
[TRT-LLM] [E] Failed to initialize executor:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 365, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 63, in __init__
    self.setup_engine()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 257, in setup_engine
    self.engine = _create_py_executor(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 228, in _create_py_executor
    _executor = create_executor(**args)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 338, in create_py_executor
    model_engine = PyTorchModelEngine(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 204, in __init__
    self.model, moe_load_balancer = self.model_loader.load(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 275, in load
    self._call_load_weights(model.load_weights, weights,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 400, in _call_load_weights
    load_method(weights, **kargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_glm.py", line 877, in load_weights
    _load_weights_impl(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 935, in _load_weights_impl
    load_single_module(name, module)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 907, in load_single_module
    module.load_weights(weights=module_weights,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 2283, in load_weights
    self.quant_method.load_weights(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 324, in load_weights
    self.load_weights_fused_qkv_linear(module, weights, **kargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 409, in load_weights_fused_qkv_linear
    q_weight, k_weight, v_weight = load_weights_fused_qkv_helper(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 183, in load_weights_fused_qkv_helper
    assert all('weight' in weights[i] for i in range(3))
AssertionError

Metadata

Metadata

Assignees

No one assigned

    Labels

    Module:RuntimeOther generic runtime issues that does not fall into other modules

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions