MTP weight loading failure of TensorRT-LLM 1.2.0rc5 when running GLM model on GPU NVIDIA B200

## Description
TensorRT-LLM fails during weight loading for GLM 4.6 model with MTP speculative decoding enabled. Error occurs in `load_weights_fused_qkv_helper` with assertion failure:
```
AssertionError: assert all('weight' in weights[i] for i in range(3))
```
Weight loading reaches 99% (2700/2737) before failing.

## Environment
**TensorRT Version**: TensorRT-LLM 1.2.0rc5 (nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc5)
**NVIDIA GPU**: NVIDIA B200 (8x GPUs)
**NVIDIA Driver Version**: 580.95.05
**CUDA Version**: 12.8 (V12.8.61)
**CUDNN Version**: 
**Operating System**: Linux
**Python Version**: 3.12
**Tensorflow Version**: N/A
**PyTorch Version**: 
**Baremetal or Container**: Container (Docker)

## Relevant Files
**Model link**: GLM-based model with MTP configuration. When the speculative_config is removed the model is able to be served without errors

**Configuration**:
```yaml
speculative_config:
  decoding_type: MTP
  num_nextn_predict_layers: 3
  use_relaxed_acceptance_for_thinking: true

# Other config
arguments_as_json: true
backend: pytorch
cuda_graph_config:
  enable_padding: false
enable_chunked_prefill: true
enable_iter_perf_stats: true
guided_decoding_backend: xgrammar
kv_cache_config:
  dtype: fp8
  enable_block_reuse: true
  enable_partial_reuse: false
  event_buffer_max_size: 16384
  free_gpu_memory_fraction: 0.8
  host_cache_size: 100000000000
max_batch_size: 24
max_beam_width: 1
max_input_len: 202752
max_num_tokens: 8192
max_seq_len: 202752
model_level_stop_words:
- <|user|>
- <|observation|>
model_name: baseten-admin/glm-4.6-fp4-mlp
moe_expert_parallel_size: 4
served_model_name: zai-org/GLM-4.6
tensor_parallel_size: 8
tokenizer_limit_length: 202752
tool_call_parser: glm45
```

## Steps To Reproduce
**Commands or scripts**:
1. Configure GLM model with MTP speculative decoding using config above
2. Initialize executor with distributed setup
3. Weight loading fails at 99% completion

**Full traceback**:
```
Loading weights:  99%|█████████▊| 2700/2737 [00:21<00:00, 123.12it/s]
[TRT-LLM] [E] Failed to initialize executor:
Traceback (most recent call last):
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 365, in worker_main
    worker: GenerationExecutorWorker = worker_cls(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/worker.py", line 63, in __init__
    self.setup_engine()
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 257, in setup_engine
    self.engine = _create_py_executor(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/executor/base_worker.py", line 228, in _create_py_executor
    _executor = create_executor(**args)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/py_executor_creator.py", line 338, in create_py_executor
    model_engine = PyTorchModelEngine(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_engine.py", line 204, in __init__
    self.model, moe_load_balancer = self.model_loader.load(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 275, in load
    self._call_load_weights(model.load_weights, weights,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/pyexecutor/model_loader.py", line 400, in _call_load_weights
    load_method(weights, **kargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_glm.py", line 877, in load_weights
    _load_weights_impl(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 935, in _load_weights_impl
    load_single_module(name, module)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/models/modeling_utils.py", line 907, in load_single_module
    module.load_weights(weights=module_weights,
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 2283, in load_weights
    self.quant_method.load_weights(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 324, in load_weights
    self.load_weights_fused_qkv_linear(module, weights, **kargs)
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 409, in load_weights_fused_qkv_linear
    q_weight, k_weight, v_weight = load_weights_fused_qkv_helper(
  File "/usr/local/lib/python3.12/dist-packages/tensorrt_llm/_torch/modules/linear.py", line 183, in load_weights_fused_qkv_helper
    assert all('weight' in weights[i] for i in range(3))
AssertionError
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

MTP weight loading failure of TensorRT-LLM 1.2.0rc5 when running GLM model on GPU NVIDIA B200 #4667

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

MTP weight loading failure of TensorRT-LLM 1.2.0rc5 when running GLM model on GPU NVIDIA B200 #4667

Description

Description

Environment

Relevant Files

Steps To Reproduce

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions