Skip to content

Commit 363ee3c

Browse files
author
Chenjie Luo
committed
Update readme
1 parent 2bca299 commit 363ee3c

File tree

1 file changed

+15
-3
lines changed

1 file changed

+15
-3
lines changed

README.md

Lines changed: 15 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,14 +9,16 @@
99
[![license](https://img.shields.io/badge/License-MIT-blue)](./LICENSE)
1010

1111
[Examples](#examples) |
12+
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
1213
[Benchmark Results](./benchmark.md) |
13-
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
14+
[Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108) |
1415
[ModelOpt-Windows](./windows/README.md)
1516

1617
</div>
1718

1819
## Latest News
1920

21+
- \[2024/10/23\] Quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: [8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8), [405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8)
2022
- \[2024/9/10\] [Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/)
2123
- \[2024/8/28\] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
2224
- \[2024/8/28\] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
@@ -40,6 +42,8 @@
4042
- [Examples](#examples)
4143
- [Support Matrix](#support-matrix)
4244
- [Benchmark](#benchmark)
45+
- [Quantized Checkpoints](#quantized-checkpoints)
46+
- [Roadmap](#roadmap)
4347
- [Release Notes](#release-notes)
4448

4549
## Model Optimizer Overview
@@ -48,10 +52,10 @@ Minimizing inference costs presents a significant challenge as generative AI mod
4852
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [sparsity](#sparsity), [distillation](#distillation), and [pruning](#pruning) to compress models.
4953
It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint.
5054
Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT).
51-
Further integrations are planned for [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
55+
ModelOpt is integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
5256
For enterprise users, the 8-bit quantization with Stable Diffusion is also available on [NVIDIA NIM](https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/).
5357

54-
Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
58+
Model Optimizer for both Linux and Windows are available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
5559

5660
## Installation / Docker
5761

@@ -127,6 +131,14 @@ by using a more powerful model's learned features to guide a student model's obj
127131

128132
Please find the benchmarks [here](./benchmark.md).
129133

134+
## Quantized Checkpoints
135+
136+
[Quantized checkpoints](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4) in Hugging Face model hub are ready for [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM) and [vLLM](https://github.com/vllm-project/vllm) deployments. More models coming soon.
137+
138+
## Roadmap
139+
140+
Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108).
141+
130142
## Release Notes
131143

132144
Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html).

0 commit comments

Comments
 (0)