[Usage]: How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model?

### System Info

GB200 和 GB300、H20 × 8 or  16


### How would you like to use TensorRT-LLM

How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model?

Are these operations below normal?  Are these operations feasible?

1. Perform quantization calibration + engine building
```
python quantize.py \\
  --model_dir /path/to/your_moe_model \\  # MoE 模型路径
  --qformat nvfp4 \\                     # 专家层量化格式（NVFP4）
  --attn_qformat fp8 \\                  # 注意力层量化格式（FP8）
  --kv_cache_dtype fp8 \\                # KV 缓存用 FP8（可选）
  --moe_num_experts 8 \\                 # MoE 专家数量（匹配模型）
  --moe_top_k 2 \\                       # 每个 Token 选择的 Top-k 专家
  --calib_size 512 \\                    # 校准数据量（推荐 512）
  --output_dir /path/to/nvfp4_fp8_ckpt  # 输出量化权重目录
```

AND

```
trtllm-build \\
  --checkpoint_dir /path/to/nvfp4_fp8_ckpt \\  # 输入量化权重
  --output_dir /path/to/moe_fp8_engine \\      # 输出引擎目录
  --use_fp8_context_fmha enable \\             # 开启 FP8 FMHA（上下文阶段）
  --moe_plugin auto \\                         # 启用 MoE 定制化 Kernel
  --moe_num_experts 8 \\                       # 匹配模型专家数
  --moe_top_k 2 \\                             # 匹配校准阶段的 Top-k
  --dtype bfloat16 \\                          # 基础精度（兼容低精度量化）
  --tp_size 2 \\                               # 张量并行度（按需设置）

```

2. Perform service startup + inference testing

```
trtllm-serve serve /data2/DeepSeek-R1-671B-NVFP4-FP8 \\  # 替换为量化引擎的路径
  --tokenizer /data2/DeepSeek-R1-671B \\  # tokenizer路径不变，复用原模型的
  --host 0.0.0.0 \\
  --port 8000 \\
  --tp_size 8 \\
  --ep_size 8 \\
  --max_batch_size 64 \\
  --max_seq_len 4096 \\
  --max_num_tokens 16384 \\
  --kv_cache_free_gpu_memory_fraction 0.9 \\
  --trust_remote_code \\
  --reasoning_parser deepseek-r1 \\
  --backend tensorrtllm \\  # 关键：改为tensorrtllm后端，适配编译后的引擎
  --enable_chunked_prefill \\
  --quant_mode fp8 \\  # 声明量化模式，和量化策略对应
  --moe_num_experts 8  # 补充MoE专家数，和量化/构建阶段一致（根据你的模型调整）
```



### Before submitting a new issue...

- [x] Make sure you already searched for relevant issues, and checked the [documentation](https://nvidia.github.io/TensorRT-LLM/) and [examples](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples) for answers to frequently asked questions.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Usage]: How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model? #10306

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[Usage]: How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model? #10306

Description

System Info

How would you like to use TensorRT-LLM

Before submitting a new issue...

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions