Skip to content

Commit ece3a87

Browse files
authored
[None][doc] Update doc for NVFP4 KV cache (#9475)
Signed-off-by: Tian Zheng <[email protected]>
1 parent 2f03031 commit ece3a87

File tree

1 file changed

+57
-30
lines changed

1 file changed

+57
-30
lines changed

docs/source/features/quantization.md

Lines changed: 57 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ TensorRT LLM offers a variety of quantization recipes to optimize LLM inference.
1111
* FP8 Block Scaling
1212
* FP8 Rowwise
1313
* FP8 KV Cache
14+
* NVFP4 KV Cache
1415
* W4A16 GPTQ
1516
* W4A8 GPTQ
1617
* W4A16 AWQ
@@ -47,6 +48,20 @@ llm = LLM(model='/path/to/model',
4748
llm.generate("Hello, my name is")
4849
```
4950

51+
#### NVFP4 KV Cache
52+
53+
To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions.
54+
After the quantization is done, the NVFP4 KV cache option can be set by:
55+
56+
```python
57+
from tensorrt_llm import LLM
58+
from tensorrt_llm.llmapi import KvCacheConfig
59+
llm = LLM(model='/path/to/model',
60+
kv_cache_config=KvCacheConfig(dtype='nvfp4'))
61+
llm.generate("Hello, my name is")
62+
```
63+
64+
5065
### Offline Quantization with ModelOpt
5166

5267
If a pre-quantized model is not available on the [Hugging Face Hub](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4), you can quantize it offline using ModelOpt.
@@ -56,33 +71,45 @@ Follow this step-by-step guide to quantize a model:
5671
```bash
5772
git clone https://github.com/NVIDIA/Model-Optimizer.git
5873
cd Model-Optimizer/examples/llm_ptq
59-
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --export_fmt hf
74+
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8
75+
```
76+
77+
#### NVFP4 KV Cache
78+
79+
To generate the checkpoint for NVFP4 KV cache:
80+
81+
```bash
82+
git clone https://github.com/NVIDIA/Model-Optimizer.git
83+
cd TensorRT-Model-Optimizer/examples/llm_ptq
84+
scripts/huggingface_example.sh --model <huggingface_model_card> --quant fp8 --kv_cache_quant nvfp4
6085
```
6186

87+
Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, `--quant fp8` is required here.
88+
6289
## Model Supported Matrix
6390

64-
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache |W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
65-
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
66-
| BERT | . | . | . | . | . | Y | . | . | . | . |
67-
| DeepSeek-R1 | Y | . | . | Y | . | Y | . | . | . | . |
68-
| EXAONE | . | . | Y | . | . | Y | Y | Y | . | . |
69-
| Gemma 3 | . | . | Y | . | . | Y | Y | Y | . | . |
70-
| GPT-OSS | . | Y | . | . | . | Y | . | . | . | . |
71-
| LLaMA | Y | . | Y | . | . | Y | . | Y | . | Y |
72-
| LLaMA-v2 | Y | . | Y | . | . | Y | Y | Y | . | Y |
73-
| LLaMA 3 | . | . | . | . | Y | Y | Y | . | . | . |
74-
| LLaMA 4 | Y | . | Y | . | . | Y | . | . | . | . |
75-
| Mistral | . | . | Y | . | . | Y | . | Y | . | . |
76-
| Mixtral | Y | . | Y | . | . | Y | . | . | . | . |
77-
| Phi | . | . | . | . | . | Y | Y | . | . | . |
78-
| Qwen | . | . | . | . | . | Y | Y | Y | . | Y |
79-
| Qwen-2/2.5 | Y | . | Y | . | . | Y | Y | Y | . | Y |
80-
| Qwen-3 | Y | . | Y | . | . | Y | . | Y | . | Y |
81-
| BLIP2-OPT | . | . | . | . | . | Y | . | . | . | . |
82-
| BLIP2-T5 | . | . | . | . | . | Y | . | . | . | . |
83-
| LLaVA | . | . | Y | . | . | Y | . | Y | . | Y |
84-
| VILA | . | . | Y | . | . | Y | . | Y | . | Y |
85-
| Nougat | . | . | . | . | . | Y | . | . | . | . |
91+
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
92+
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: |:---:| :-------: | :-------: | :--------: | :--------: |
93+
| BERT | . | . | . | . | . | Y | . | . | . | . | . |
94+
| DeepSeek-R1 | Y | . | . | Y | . | Y | . | . | . | . | . |
95+
| EXAONE | . | . | Y | . | . | Y | . | Y | Y | . | . |
96+
| Gemma 3 | . | . | Y | . | . | Y | . | Y | Y | . | . |
97+
| GPT-OSS | . | Y | . | . | . | Y | . | . | . | . | . |
98+
| LLaMA | Y | . | Y | . | . | Y | . | . | Y | . | Y |
99+
| LLaMA-v2 | Y | . | Y | . | . | Y | Y | Y | Y | . | Y |
100+
| LLaMA 3 | . | . | . | . | Y | Y | Y | Y | . | . | . |
101+
| LLaMA 4 | Y | . | Y | . | . | Y | . | . | . | . | . |
102+
| Mistral | . | . | Y | . | . | Y | . | . | Y | . | . |
103+
| Mixtral | Y | . | Y | . | . | Y | . | . | . | . | . |
104+
| Phi | . | . | . | . | . | Y | . | Y | . | . | . |
105+
| Qwen | . | . | . | . | . | Y | . | Y | Y | . | Y |
106+
| Qwen-2/2.5 | Y | . | Y | . | . | Y | . | Y | Y | . | Y |
107+
| Qwen-3 | Y | . | Y | . | . | Y | Y | . | Y | . | Y |
108+
| BLIP2-OPT | . | . | . | . | . | Y | . | . | . | . | . |
109+
| BLIP2-T5 | . | . | . | . | . | Y | . | . | . | . | . |
110+
| LLaVA | . | . | Y | . | . | Y | . | . | Y | . | Y |
111+
| VILA | . | . | Y | . | . | Y | . | . | Y | . | Y |
112+
| Nougat | . | . | . | . | . | Y | . | . | . | . | . |
86113

87114

88115
```{note}
@@ -93,13 +120,13 @@ The language component decides which quantization methods are supported by a giv
93120

94121
## Hardware Support Matrix
95122

96-
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache |W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
97-
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
98-
| Blackwell(sm120) | Y | Y | Y | . | . | Y | . | . | . | . |
99-
| Blackwell(sm100) | Y | Y | Y | Y | . | Y | . | . | . | . |
100-
| Hopper | . | . | Y | Y | Y | Y | Y | Y | Y | Y |
101-
| Ada Lovelace | . | . | Y | . | . | Y | Y | Y | Y | Y |
102-
| Ampere | . | . | . | . | . | Y | . | Y | . | Y |
123+
| Model | NVFP4 | MXFP4 | FP8(per tensor)| FP8(block scaling) | FP8(rowwise) | FP8 KV Cache | NVFP4 KV Cache | W4A8 AWQ | W4A16 AWQ | W4A8 GPTQ | W4A16 GPTQ |
124+
| :------------- | :---: | :---: | :---: | :---: | :---: | :---: | :---: | :-------: | :-------: | :--------: | :--------: |
125+
| Blackwell(sm120) | Y | Y | Y | . | . | Y | . | . | . | . | . |
126+
| Blackwell(sm100) | Y | Y | Y | Y | . | Y | Y | . | . | . | . |
127+
| Hopper | . | . | Y | Y | Y | Y | . | Y | Y | Y | Y |
128+
| Ada Lovelace | . | . | Y | . | . | Y | . | Y | Y | Y | Y |
129+
| Ampere | . | . | . | . | . | Y | . | . | Y | . | Y |
103130
```{note}
104131
FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
105132
```

0 commit comments

Comments
 (0)