[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4

### System Info


When using trtllm-build with --gemm_plugin nvfp4 and managed weights (default behavior or explicit), the runtime fails to load the model due to a strict type mismatch between the Engine expectation (DataType::kFP4) and the Safetensors container (DataType::kINT8 / kUINT8).
Additionally, bypassing the type check reveals that the BufferManager allocates memory based on the element count treating it as 8-bit (1 byte/elem) rather than packed 4-bit (0.5 byte/elem), causing double VRAM usage for weights.
Reproduction Steps
Quantize Qwen 2.5/3 to NVFP4 using modelopt.
Build engine with managed weights:
   trtllm-build --checkpoint_dir ... --gemm_plugin nvfp4 --output_dir ...
(Note: Managed weights are generated as .safetensors with I8 or U8 dtype since safetensors lacks FP4 support).
Run trtllm-serve.
Observed Behavior
1. Type Mismatch Error:
The runtime crashes immediately in tllmRuntime.cpp:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())Weight ... has dtype INT8 but engine expects FP4
2. Double Allocation (OOM):
If the type assertion is removed/bypassed, the model loads but consumes ~30GB VRAM for a 30B model (expected ~15GB), indicating the allocator is not handling the packed 4-bit stride correctly for managed weights.
Proposed Fix (Analysis)
The issue is located in cpp/tensorrt_llm/runtime/tllmRuntime.cpp, inside setInputTensorsImpl.
The runtime needs logic to handle NVFP4 packed weights specifically:
Accept INT8/UINT8 input tensors if the Engine expects FP4.
Allocate memory using the correct size (packed), not num_elements * 1 byte.
Workaround used:
We successfully ran the model by patching tllmRuntime.cpp to manually use cudaMalloc with size / 2 and wrapping it via ITensor::wrap.
Environment
TRT-LLM Version: v1.2.0rc4 (Docker: release:1.2.0rc4)
GPU: RTX 5090 (Blackwell)
Model: Qwen 3 30B MoE (NVFP4)



Full patch and tutorial available here:
https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

### Who can help?

@Tracin 

### Information

- [x] The official example scripts
- [ ] My own modified scripts

### Tasks

- [x] An officially supported task in the `examples` folder (such as GLUE/SQuAD, ...)
- [ ] My own task or dataset (give details below)

### Reproduction

## Steps to reproduce:

1. Quantize model to NVFP4:
```bash
python3 examples/quantization/quantize.py \
    --model_dir Qwen/Qwen3-30B-A3B-Instruct \
    --qformat nvfp4 \
    --dtype bfloat16 \
    --output_dir ./Qwen3-30B-NVFP4-Ckpt
```

2. Build engine with managed weights:
```bash
trtllm-build \
    --checkpoint_dir ./Qwen3-30B-NVFP4-Ckpt \
    --gemm_plugin nvfp4 \
    --max_batch_size 1 \
    --max_seq_len 4096 \
    --output_dir ./Qwen3-30B-NVFP4-Engine
```

3. Run server:
```bash
trtllm-serve serve ./Qwen3-30B-NVFP4-Engine \
    --tokenizer Qwen/Qwen3-30B-A3B-Instruct
```

## Error:
[TensorRT-LLM][ERROR] Assertion failed: weight->dtype() == engine.getTensorDataType(name.c_str())
(../tensorrt_llm/runtime/tllmRuntime.cpp:822)

### Expected behavior

Runtime should accept INT8/UINT8 packed weights when engine expects FP4 (this is the standard safetensors representation for 4-bit packed data).

BufferManager should allocate 0.5 bytes per FP4 element, not 1 byte.

Model should load with ~15GB VRAM usage for 30B NVFP4 model.

### actual behavior

1. Type mismatch assertion fails immediately - runtime rejects INT8 weights for FP4 engine tensors.

2. When assertion is bypassed, BufferManager allocates 2x expected memory (30GB instead of 15GB for 30B model).

3. Results in OOM on 32GB GPU (RTX 5090) where model should fit comfortably.

### additional notes

## Environment:
- TRT-LLM: v1.2.0rc4 (Docker: nvcr.io/nvidia/tensorrt-llm/release:1.2.0rc4)
- GPU: RTX 5090 32GB (Blackwell)
- Driver: 580.xx (CUDA 13.0)

## Root cause analysis:
Issue is in `cpp/tensorrt_llm/runtime/tllmRuntime.cpp`, function `setInputTensorsImpl` around line 820.

## Workaround:
I created a patch using direct cudaMalloc to bypass the faulty allocator. 
Achieves 135 tok/s, 24GB VRAM usage.

Full patch: https://github.com/JohnTDI-cpu/trtllm-nvfp4-blackwell-fix

### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4 #9503

System Info

Who can help?

Information

Tasks

Reproduction

Steps to reproduce:

Error:

Expected behavior

actual behavior

additional notes

Environment:

Root cause analysis:

Workaround:

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Bug]: Managed Weights fail with NVFP4: Strict Type Check & Incorrect Allocation in v1.2.0rc4 #9503

Description

System Info

Who can help?

Information

Tasks

Reproduction

Steps to reproduce:

Error:

Expected behavior

actual behavior

additional notes

Environment:

Root cause analysis:

Workaround:

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions