You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+3-1Lines changed: 3 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -10,14 +10,16 @@ Model Optimizer Changelog (Linux)
10
10
11
11
**New Features**
12
12
13
-
- New model support in the ``llm_ptq`` example: OpenAI Whisper.
13
+
- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
14
14
- Blockwise FP8 quantization support in unified model export.
15
15
- Add quantization support to the Transformer Engine Linear module.
16
16
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
17
17
- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
18
18
- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
19
19
- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
20
20
- Add option to simplify ONNX model before quantization is performed.
21
+
- Add FP4 KV cache support for unified HF and TensorRT-LLM export.
22
+
- Add speculative decoding support to Multi-Token Prediction (MTP) in Megatron Core models.
21
23
- (Experimental) Improve support for ONNX models with custom TensorRT op:
22
24
- Add support for ``--calibration_shapes`` flag.
23
25
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
Copy file name to clipboardExpand all lines: README.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -18,6 +18,7 @@
18
18
19
19
## Latest News
20
20
21
+
-[2025/04/05][NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
21
22
-[2025/03/18][World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
22
23
-[2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
23
24
-[2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
> *Calibration by default uses left padding_side for the Huggingface tokenizer as it usually leads to lower accuracy loss. The exported tokenizer files restores the default padding_side.*
58
58
59
+
#### Llama 4
60
+
61
+
We support FP8 and NVFP4 quantized Llama 4 model Hugging Face checkpoint export using the following command:
62
+
63
+
```bash
64
+
python hf_ptq.py --pyt_ckpt_path=<llama4 model path> --export_path=<quantized hf checkpoint> --qformat=[fp8|nvfp4] --export_fmt=hf
65
+
```
66
+
67
+
The quantized checkpoint can be deployed following the TensorRT-LLM instructions.
68
+
59
69
#### For NeMo models like [nemotron](https://huggingface.co/nvidia/nemotron-3-8b-base-4k):
60
70
61
71
NeMo PTQ requires the NeMo package installed. It's recommended to start from the NeMo containers like `nvcr.io/nvidia/nemo:24.07` or latest `nvcr.io/nvidia/nemo:dev` directly.
0 commit comments