|
18 | 18 |
|
19 | 19 | ## Latest News
|
20 | 20 |
|
| 21 | +- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/) |
21 | 22 | - [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
|
22 | 23 | - [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
|
23 | 24 | - [2025/01/28] Model Optimizer is now open source!
|
|
44 | 45 | - [Model Optimizer Overview](#model-optimizer-overview)
|
45 | 46 | - [Installation](#installation--docker)
|
46 | 47 | - [Techniques](#techniques)
|
47 |
| - - [Quantization](#quantization) |
| 48 | + - [Quantization](#quantization-examples-docs) |
48 | 49 | - [Quantized Checkpoints](#quantized-checkpoints)
|
49 |
| - - [Pruning](#pruning) |
50 |
| - - [Distillation](#distillation) |
51 |
| - - [Speculative Decoding](#speculative-decoding) |
52 |
| - - [Sparsity](#sparsity) |
| 50 | + - [Pruning](#pruning-examples-docs) |
| 51 | + - [Distillation](#distillation-examples-docs) |
| 52 | + - [Speculative Decoding](#speculative-decoding-examples-docs) |
| 53 | + - [Sparsity](#sparsity-examples-docs) |
53 | 54 | - [Examples](#examples)
|
54 | 55 | - [Support Matrix](#model-support-matrix)
|
55 | 56 | - [Benchmark](#benchmark)
|
@@ -114,7 +115,7 @@ Below is a short description of the techniques supported by Model Optimizer.
|
114 | 115 |
|
115 | 116 | ### Quantization \[[examples](./examples/README.md#quantization)\] \[[docs](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/1_quantization.html)\]
|
116 | 117 |
|
117 |
| -Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported. |
| 118 | +Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including NVFP4, FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, SVDQuant, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported. |
118 | 119 |
|
119 | 120 | #### Quantized Checkpoints
|
120 | 121 |
|
@@ -158,7 +159,7 @@ Please find the [detailed performance benchmarks](./examples/benchmark.md).
|
158 | 159 |
|
159 | 160 | ## Roadmap
|
160 | 161 |
|
161 |
| -Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108). |
| 162 | +Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/146). |
162 | 163 |
|
163 | 164 | ## Release Notes
|
164 | 165 |
|
|
0 commit comments