Skip to content

[Usage]: How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model? #10306

@Alan-D-Chen

Description

@Alan-D-Chen

System Info

GB200 和 GB300、H20 × 8 or 16

How would you like to use TensorRT-LLM

How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model?

Are these operations below normal? Are these operations feasible?

  1. Perform quantization calibration + engine building
python quantize.py \\
  --model_dir /path/to/your_moe_model \\  # MoE 模型路径
  --qformat nvfp4 \\                     # 专家层量化格式(NVFP4)
  --attn_qformat fp8 \\                  # 注意力层量化格式(FP8)
  --kv_cache_dtype fp8 \\                # KV 缓存用 FP8(可选)
  --moe_num_experts 8 \\                 # MoE 专家数量(匹配模型)
  --moe_top_k 2 \\                       # 每个 Token 选择的 Top-k 专家
  --calib_size 512 \\                    # 校准数据量(推荐 512)
  --output_dir /path/to/nvfp4_fp8_ckpt  # 输出量化权重目录

AND

trtllm-build \\
  --checkpoint_dir /path/to/nvfp4_fp8_ckpt \\  # 输入量化权重
  --output_dir /path/to/moe_fp8_engine \\      # 输出引擎目录
  --use_fp8_context_fmha enable \\             # 开启 FP8 FMHA(上下文阶段)
  --moe_plugin auto \\                         # 启用 MoE 定制化 Kernel
  --moe_num_experts 8 \\                       # 匹配模型专家数
  --moe_top_k 2 \\                             # 匹配校准阶段的 Top-k
  --dtype bfloat16 \\                          # 基础精度(兼容低精度量化)
  --tp_size 2 \\                               # 张量并行度(按需设置)

  1. Perform service startup + inference testing
trtllm-serve serve /data2/DeepSeek-R1-671B-NVFP4-FP8 \\  # 替换为量化引擎的路径
  --tokenizer /data2/DeepSeek-R1-671B \\  # tokenizer路径不变,复用原模型的
  --host 0.0.0.0 \\
  --port 8000 \\
  --tp_size 8 \\
  --ep_size 8 \\
  --max_batch_size 64 \\
  --max_seq_len 4096 \\
  --max_num_tokens 16384 \\
  --kv_cache_free_gpu_memory_fraction 0.9 \\
  --trust_remote_code \\
  --reasoning_parser deepseek-r1 \\
  --backend tensorrtllm \\  # 关键:改为tensorrtllm后端,适配编译后的引擎
  --enable_chunked_prefill \\
  --quant_mode fp8 \\  # 声明量化模式,和量化策略对应
  --moe_num_experts 8  # 补充MoE专家数,和量化/构建阶段一致(根据你的模型调整)

Before submitting a new issue...

  • Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.questionFurther information is requested

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions