Skip to content

Commit 8824e8b

Browse files
author
bruce.xu
committed
support W4afp8 quant in v3.1
Signed-off-by: bruce.xu <[email protected]>
1 parent 76e8ce2 commit 8824e8b

File tree

2 files changed

+21
-4
lines changed

2 files changed

+21
-4
lines changed

examples/deepseek/README.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -43,3 +43,20 @@ We provide a one-step-script which will:
4343
```bash
4444
./quantize_fp8_to_nvfp4.sh --amax_path $FP4_QUANT_PATH --fp4_output_path $HF_FP4_PATH --fp8_hf_path $HF_FP8_CKPT --world_size 8
4545
```
46+
47+
#### W4AFP8 for V3 & R1
48+
49+
firstly , prepare a trtllm image, like
50+
```bash
51+
docker run --rm -it --ipc=host --ulimit memlock=-1 --ulimit stack=67108864 --gpus=all \
52+
nvcr.io/nvidia/tensorrt-llm/release
53+
```
54+
then we can operate modelopt in the docker pod as (Trtllm example)[https://github.com/NVIDIA/TensorRT-LLM/blob/main/examples/models/core/deepseek_v3/README.md?plain=1]
55+
but we should notice that just using the latest DeepSeek-V3.git is ok, because there is a dtype bug in bias proto at commit 1398800.
56+
57+
#### W4AFP8 for V3.1
58+
59+
The basic operation is the same as V3.
60+
But we need to notice two point:
61+
1. use config_v3.1.json or add "scale_fmt":"ue8m0" in config_671B.json.ue8m0 is a key item as it was used in training of V3.1
62+
2. set gemm_impl to fp8 (default is bf16) to enbale ue8m0 quant kernel

examples/deepseek/ptq.py

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -43,7 +43,7 @@
4343
import os
4444
import sys
4545
from pathlib import Path
46-
from typing import Literal
46+
from typing import Optional, Literal
4747

4848
import torch
4949
import torch.distributed as dist
@@ -79,6 +79,7 @@ def linear(
7979
bias: torch.Tensor | None = None,
8080
act_quantizer: TensorQuantizer | None = None,
8181
weight_quantizer: TensorQuantizer | None = None,
82+
scale_fmt: Optional[str] = None,
8283
) -> torch.Tensor:
8384
if weight.element_size() > 1:
8485
if act_quantizer is not None:
@@ -95,9 +96,7 @@ def linear(
9596

9697
return F.linear(x, weight, bias)
9798
else:
98-
assert weight_quantizer is None
99-
assert act_quantizer is None
100-
x, scale = act_quant(x, block_size)
99+
x, scale = act_quant(x, block_size, scale_fmt)
101100
y = fp8_gemm(x, scale, weight, weight.scale)
102101
if bias is not None:
103102
y += bias
@@ -174,6 +173,7 @@ def forward(self, x: torch.Tensor) -> torch.Tensor:
174173
self.bias,
175174
act_quantizer=self.input_quantizer,
176175
weight_quantizer=self.weight_quantizer,
176+
scale_fmt=self.scale_fmt,
177177
)
178178
return y
179179

0 commit comments

Comments
 (0)