You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
To enable NVFP4 KV cache, offline quantization with ModelOpt is required. Please follow the below section for instructions.
54
+
After the quantization is done, the NVFP4 KV cache option can be set by:
55
+
56
+
```python
57
+
from tensorrt_llm importLLM
58
+
from tensorrt_llm.llmapi import KvCacheConfig
59
+
llm = LLM(model='/path/to/model',
60
+
kv_cache_config=KvCacheConfig(dtype='nvfp4'))
61
+
llm.generate("Hello, my name is")
62
+
```
63
+
64
+
50
65
### Offline Quantization with ModelOpt
51
66
52
67
If a pre-quantized model is not available on the [Hugging Face Hub](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4), you can quantize it offline using ModelOpt.
@@ -56,33 +71,45 @@ Follow this step-by-step guide to quantize a model:
Note that currently TRT-LLM only supports FP8 weight/activation quantization when NVFP4 KV cache is enabled. Therefore, `--quant fp8` is required here.
| Blackwell(sm120) | Y | Y | Y | . | . | Y |. | . | . | . | . |
126
+
| Blackwell(sm100) | Y | Y | Y | Y | . | Y |Y | . | . | . | . |
127
+
| Hopper | . | . | Y | Y | Y | Y |. | Y | Y | Y | Y |
128
+
| Ada Lovelace | . | . | Y | . | . | Y |. | Y | Y | Y | Y |
129
+
| Ampere | . | . | . | . | . | Y |. | . | Y | . | Y |
103
130
```{note}
104
131
FP8 block wise scaling GEMM kernels for sm100 are using MXFP8 recipe (E4M3 act/weight and UE8M0 act/weight scale), which is slightly different from SM90 FP8 recipe (E4M3 act/weight and FP32 act/weight scale).
0 commit comments