-
Notifications
You must be signed in to change notification settings - Fork 2.1k
Open
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.questionFurther information is requestedFurther information is requested
Description
System Info
GB200 和 GB300、H20 × 8 or 16
How would you like to use TensorRT-LLM
How to use TensorRT LLM to perform MoE NVFP4 | ATTN FP8 inference testing for the Deepseek R1 671B model?
Are these operations below normal? Are these operations feasible?
- Perform quantization calibration + engine building
python quantize.py \\
--model_dir /path/to/your_moe_model \\ # MoE 模型路径
--qformat nvfp4 \\ # 专家层量化格式(NVFP4)
--attn_qformat fp8 \\ # 注意力层量化格式(FP8)
--kv_cache_dtype fp8 \\ # KV 缓存用 FP8(可选)
--moe_num_experts 8 \\ # MoE 专家数量(匹配模型)
--moe_top_k 2 \\ # 每个 Token 选择的 Top-k 专家
--calib_size 512 \\ # 校准数据量(推荐 512)
--output_dir /path/to/nvfp4_fp8_ckpt # 输出量化权重目录
AND
trtllm-build \\
--checkpoint_dir /path/to/nvfp4_fp8_ckpt \\ # 输入量化权重
--output_dir /path/to/moe_fp8_engine \\ # 输出引擎目录
--use_fp8_context_fmha enable \\ # 开启 FP8 FMHA(上下文阶段)
--moe_plugin auto \\ # 启用 MoE 定制化 Kernel
--moe_num_experts 8 \\ # 匹配模型专家数
--moe_top_k 2 \\ # 匹配校准阶段的 Top-k
--dtype bfloat16 \\ # 基础精度(兼容低精度量化)
--tp_size 2 \\ # 张量并行度(按需设置)
- Perform service startup + inference testing
trtllm-serve serve /data2/DeepSeek-R1-671B-NVFP4-FP8 \\ # 替换为量化引擎的路径
--tokenizer /data2/DeepSeek-R1-671B \\ # tokenizer路径不变,复用原模型的
--host 0.0.0.0 \\
--port 8000 \\
--tp_size 8 \\
--ep_size 8 \\
--max_batch_size 64 \\
--max_seq_len 4096 \\
--max_num_tokens 16384 \\
--kv_cache_free_gpu_memory_fraction 0.9 \\
--trust_remote_code \\
--reasoning_parser deepseek-r1 \\
--backend tensorrtllm \\ # 关键:改为tensorrtllm后端,适配编译后的引擎
--enable_chunked_prefill \\
--quant_mode fp8 \\ # 声明量化模式,和量化策略对应
--moe_num_experts 8 # 补充MoE专家数,和量化/构建阶段一致(根据你的模型调整)
Before submitting a new issue...
- Make sure you already searched for relevant issues, and checked the documentation and examples for answers to frequently asked questions.
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
Inference runtime<NV>General operational aspects of TRTLLM execution not in other categories.<NV>General operational aspects of TRTLLM execution not in other categories.questionFurther information is requestedFurther information is requested