-
Notifications
You must be signed in to change notification settings - Fork 2.2k
Description
System Info
When using trtllm-build with --gemm_plugin nvfp4 and managed weights (default behavior or explicit), the runtime fails to load the model due to a strict type mismatch between the Engine expectation (DataType::kFP4) and the Safetensors container (DataType::kINT8 / kUINT8).
Additionally, bypassing the type check reveals that the BufferManager allocates memory based on the element count treating it as 8-bit (1 byte/elem) rather than packed 4-bit (0.5 byte/elem), causing double VRAM usage for weights.
Reproduction Steps
Quantize Qwen 2.5/3 to NVFP4 using modelopt.
Build engine with managed weights:
trtllm-build --checkpoint_dir ... --gemm_plugin nvfp4 --output_dir ...
(Note: Managed weights are generated as .safetensors with I8 or U8 dtype since safetensors lacks FP4 support).
Run trtllm-serve.
Observed Behavior
- Type Mismatch Error:
The runtime crashes immediately in tllmRuntime.cpp:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())Weight ... has dtype INT8 but engine expects FP4 - Double Allocation (OOM):
If the type assertion is removed/bypassed, the model loads but consumes ~30GB VRAM for a 30B model (expected ~15GB), indicating the allocator is not handling the packed 4-bit stride correctly for managed weights.
Proposed Fix (Analysis)
The issue is located in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, inside setInputTensorsImpl.
The runtime needs logic to handle NVFP4 packed weights specifically:
Accept INT8/UINT8 input tensors if the Engine expects FP4.
Allocate memory using the correct size (packed), not num_elements * 1 byte.
Workaround used:
We successfully ran the model by patching tllmRuntime.cpp to manually use cudaMalloc with size / 2 and wrapping it via ITensor::wrap.
Environment
TRT-LLM Version: v1.2.0rc4 (Docker: release:1.2.0rc4)
GPU: RTX 5090 (Blackwell)
Model: Qwen 3 30B MoE (NVFP4)
Full patch and tutorial available here:
https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix
Who can help?
Information
- The official example scripts
- My own modified scripts
Tasks
- An officially supported task in the
examplesfolder (such as GLUE/SQuAD, ...) - My own task or dataset (give details below)
Reproduction
Steps to reproduce:
- Quantize model to NVFP4:
python3 examples/quantization/quantize.py \
--model_dir Qwen/Qwen3-30B-A3B-Instruct \
--qformat nvfp4 \
--dtype bfloat16 \
--output_dir ./Qwen3-30B-NVFP4-Ckpt- Build engine with managed weights:
trtllm-build \
--checkpoint_dir ./Qwen3-30B-NVFP4-Ckpt \
--gemm_plugin nvfp4 \
--max_batch_size 1 \
--max_seq_len 4096 \
--output_dir ./Qwen3-30B-NVFP4-Engine- Run server:
trtllm-serve serve ./Qwen3-30B-NVFP4-Engine \
--tokenizer Qwen/Qwen3-30B-A3B-InstructError:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())
(../tensorrt_llm/runtime/tllmRuntime.cpp:822)
Expected behavior
Runtime should accept INT8/UINT8 packed weights when engine expects FP4 (this is the standard safetensors representation for 4-bit packed data).
BufferManager should allocate 0.5 bytes per FP4 element, not 1 byte.
Model should load with ~15GB VRAM usage for 30B NVFP4 model.
actual behavior
-
Type mismatch assertion fails immediately - runtime rejects INT8 weights for FP4 engine tensors.
-
When assertion is bypassed, BufferManager allocates 2x expected memory (30GB instead of 15GB for 30B model).
-
Results in OOM on 32GB GPU (RTX 5090) where model should fit comfortably.
additional notes
Environment:
- TRT-LLM: v1.2.0rc4 (Docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4)
- GPU: RTX 5090 32GB (Blackwell)
- Driver: 580.xx (CUDA 13.0)
Root cause analysis:
Issue is in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, function setInputTensorsImpl around line 820.
Workaround:
I created a patch using direct cudaMalloc to bypass the faulty allocator.
Achieves 135 tok/s, 24GB VRAM usage.
Full patch: https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.