Skip to content

Commit 45cb1ab

Browse files
Add examples for 0.21.0 release
1 parent 363ee3c commit 45cb1ab

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

60 files changed

+3272
-1894
lines changed

.dockerignore

Lines changed: 71 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,71 @@
1+
docker
2+
**/.git
3+
llm_ptq/saved_models*
4+
5+
##### Copied from .gitignore #####
6+
# Byte-compiled / optimized / DLL files
7+
**/__pycache__
8+
**.py[cod]
9+
**$py.class
10+
11+
# C, CPP extensions
12+
*.so
13+
*.so.lock
14+
**.rendered.*.cpp
15+
**.rendered.*.o
16+
17+
# Distribution / packaging
18+
build/
19+
dist/
20+
*.egg-info/
21+
22+
# Unit test / coverage reports
23+
htmlcov/
24+
.coverage
25+
.coverage.*
26+
coverage.xml
27+
.pytest_cache/
28+
29+
# Sphinx documentation
30+
docs/_build
31+
docs/build
32+
docs/source/reference/generated
33+
34+
# Jupyter Notebook
35+
**/.ipynb_checkpoints
36+
37+
# Environments
38+
.env
39+
.venv
40+
env/
41+
venv/
42+
43+
# mypy
44+
**/.mypy_cache
45+
46+
# Vscode
47+
.vscode/*
48+
!.vscode/settings.json
49+
!.vscode/extensions.json
50+
51+
# Mac stuff
52+
**/.DS_Store
53+
54+
# Ignore experiment checkpoints
55+
**.pt
56+
**.pth.tar
57+
**.pth
58+
**.pb
59+
**.onnx
60+
**.ckpt
61+
**.safetensors
62+
**.bin
63+
**.pkl
64+
**.tar.gz
65+
**.nemo
66+
67+
# Ignore temporary files created by tox
68+
pyproject.toml.bak
69+
70+
# Ignore git clones for tests
71+
medusa-vicuna-7b-v1.3/

.gitignore

Lines changed: 22 additions & 139 deletions
Original file line numberDiff line numberDiff line change
@@ -1,176 +1,51 @@
11
# Byte-compiled / optimized / DLL files
2-
__pycache__/
3-
*.py[cod]
4-
*$py.class
2+
**/__pycache__
3+
**.py[cod]
4+
**$py.class
55

66
# C, CPP extensions
77
*.so
8-
.rendered.*.cpp
9-
.rendered.*.o
10-
# Exclude the plugin file
11-
!libfp8convkernel.so
12-
8+
*.so.lock
9+
**.rendered.*.cpp
10+
**.rendered.*.o
1311

1412
# Distribution / packaging
15-
.Python
1613
build/
17-
develop-eggs/
1814
dist/
19-
downloads/
20-
eggs/
21-
.eggs/
22-
lib/
23-
lib64/
24-
parts/
25-
sdist/
26-
var/
27-
wheels/
28-
share/python-wheels/
2915
*.egg-info/
30-
.installed.cfg
31-
*.egg
32-
MANIFEST
33-
34-
# PyInstaller
35-
# Usually these files are written by a python script from a template
36-
# before PyInstaller builds the exe, so as to inject date/other infos into it.
37-
*.manifest
38-
*.spec
39-
40-
# Installer logs
41-
pip-log.txt
42-
pip-delete-this-directory.txt
4316

4417
# Unit test / coverage reports
4518
htmlcov/
46-
.tox/
47-
.nox/
4819
.coverage
4920
.coverage.*
50-
.cache
51-
nosetests.xml
5221
coverage.xml
53-
*.cover
54-
*.py,cover
55-
.hypothesis/
5622
.pytest_cache/
57-
cover/
58-
59-
# Translations
60-
*.mo
61-
*.pot
62-
63-
# Django stuff:
64-
*.log
65-
local_settings.py
66-
db.sqlite3
67-
db.sqlite3-journal
68-
69-
# Flask stuff:
70-
instance/
71-
.webassets-cache
72-
73-
# Scrapy stuff:
74-
.scrapy
7523

7624
# Sphinx documentation
77-
docs/_build/
78-
79-
# PyBuilder
80-
.pybuilder/
81-
target/
25+
docs/_build
26+
docs/build
27+
docs/source/reference/generated
8228

8329
# Jupyter Notebook
84-
.ipynb_checkpoints
85-
86-
# IPython
87-
profile_default/
88-
ipython_config.py
89-
90-
# pyenv
91-
# For a library or package, you might want to ignore these files since the code is
92-
# intended to run in multiple environments; otherwise, check them in:
93-
# .python-version
94-
95-
# pipenv
96-
# According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
97-
# However, in case of collaboration, if having platform-specific dependencies or dependencies
98-
# having no cross-platform support, pipenv may install dependencies that don't work, or not
99-
# install all needed dependencies.
100-
#Pipfile.lock
101-
102-
# poetry
103-
# Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
104-
# This is especially recommended for binary packages to ensure reproducibility, and is more
105-
# commonly ignored for libraries.
106-
# https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
107-
#poetry.lock
108-
109-
# pdm
110-
# Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
111-
#pdm.lock
112-
# pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
113-
# in version control.
114-
# https://pdm.fming.dev/#use-with-ide
115-
.pdm.toml
116-
117-
# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
118-
__pypackages__/
119-
120-
# Celery stuff
121-
celerybeat-schedule
122-
celerybeat.pid
123-
124-
# SageMath parsed files
125-
*.sage.py
30+
**/.ipynb_checkpoints
12631

12732
# Environments
12833
.env
12934
.venv
13035
env/
13136
venv/
132-
ENV/
133-
env.bak/
134-
venv.bak/
135-
136-
# Spyder project settings
137-
.spyderproject
138-
.spyproject
139-
140-
# Rope project settings
141-
.ropeproject
142-
143-
# mkdocs documentation
144-
/site
14537

146-
# mypy
147-
.mypy_cache/
148-
.dmypy.json
149-
dmypy.json
150-
151-
# Pyre type checker
152-
.pyre/
153-
154-
# pytype static type analyzer
155-
.pytype/
156-
157-
# Cython debug symbols
158-
cython_debug/
159-
160-
# PyCharm
161-
# JetBrains specific template is maintained in a separate JetBrains.gitignore that can
162-
# be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
163-
# and can be added to the global gitignore or merged into this file. For a more nuclear
164-
# option (not recommended) you can uncomment the following to ignore the entire idea folder.
165-
#.idea/
38+
# Linters
39+
**/.mypy_cache
40+
**/.ruff_cache
16641

16742
# Vscode
16843
.vscode/*
16944
!.vscode/settings.json
17045
!.vscode/extensions.json
17146

17247
# Mac stuff
173-
.DS_Store
48+
**/.DS_Store
17449

17550
# Ignore experiment checkpoints
17651
**.pt
@@ -182,3 +57,11 @@ cython_debug/
18257
**.safetensors
18358
**.bin
18459
**.pkl
60+
**.tar.gz
61+
**.nemo
62+
63+
# Ignore temporary files created by tox
64+
pyproject.toml.bak
65+
66+
# Ignore git clones for tests
67+
medusa-vicuna-7b-v1.3/

README.md

Lines changed: 23 additions & 15 deletions
Original file line numberDiff line numberDiff line change
@@ -9,8 +9,8 @@
99
[![license](https://img.shields.io/badge/License-MIT-blue)](./LICENSE)
1010

1111
[Examples](#examples) |
12-
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
13-
[Benchmark Results](./benchmark.md) |
12+
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
13+
[Benchmark Results](#benchmark) |
1414
[Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108) |
1515
[ModelOpt-Windows](./windows/README.md)
1616

@@ -33,23 +33,24 @@
3333
## Table of Contents
3434

3535
- [Model Optimizer Overview](#model-optimizer-overview)
36-
- [Installation](#installation)
36+
- [Installation](#installation--docker)
3737
- [Techniques](#techniques)
3838
- [Quantization](#quantization)
39-
- [Sparsity](#sparsity)
4039
- [Distillation](#distillation)
4140
- [Pruning](#pruning)
41+
- [Sparsity](#sparsity)
4242
- [Examples](#examples)
43-
- [Support Matrix](#support-matrix)
43+
- [Support Matrix](#model-support-matrix)
4444
- [Benchmark](#benchmark)
4545
- [Quantized Checkpoints](#quantized-checkpoints)
4646
- [Roadmap](#roadmap)
4747
- [Release Notes](#release-notes)
48+
- [Contributing](#contributing)
4849

4950
## Model Optimizer Overview
5051

5152
Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size.
52-
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [sparsity](#sparsity), [distillation](#distillation), and [pruning](#pruning) to compress models.
53+
The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [distillation](#distillation), [pruning](#pruning), and [sparsity](#sparsity) to compress models.
5354
It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint.
5455
Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT).
5556
ModelOpt is integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
@@ -72,7 +73,7 @@ cd TensorRT-Model-Optimizer
7273

7374
# Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
7475
# You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
75-
bash docker/build.sh
76+
./docker/build.sh
7677

7778
# Run the docker image
7879
docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
@@ -91,18 +92,18 @@ NOTE: Unless specified otherwise, all example READMEs assume they are using the
9192

9293
Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
9394

94-
### Sparsity
95+
### Distillation
9596

96-
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
97+
Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
98+
by using a more powerful model's learned features to guide a student model's objective function into imitating it.
9799

98100
### Pruning
99101

100102
Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
101103

102-
### Distillation
104+
### Sparsity
103105

104-
Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
105-
by using a more powerful model's learned features to guide a student model's objective function into imitating it.
106+
Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
106107

107108
## Examples
108109

@@ -121,15 +122,18 @@ by using a more powerful model's learned features to guide a student model's obj
121122
- [ONNX PTQ](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
122123
- [Distillation for LLMs](./llm_distill/README.md) demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
123124
- [Chained Optimizations](./chained_optimizations/README.md) shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
125+
- [Model Hub](./model_hub/) provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.
124126

125-
## Support Matrix
127+
## Model Support Matrix
126128

127129
- For LLM quantization, please refer to this [support matrix](./llm_ptq/README.md#model-support-list).
128-
- For Diffusion, the Model Optimizer supports [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo).
130+
- For VLM quantization, please refer to this [support matrix](./vlm_ptq/README.md#model-support-list).
131+
- For Diffusion, Model Optimizer supports [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1).
132+
- For speculative decoding, please refer to this [support matrix](./speculative_decoding/README.md#model-support-list).
129133

130134
## Benchmark
131135

132-
Please find the benchmarks [here](./benchmark.md).
136+
Please find the benchmarks at [here](./benchmark.md).
133137

134138
## Quantized Checkpoints
135139

@@ -142,3 +146,7 @@ Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimi
142146
## Release Notes
143147

144148
Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html).
149+
150+
## Contributing
151+
152+
At the moment, we are not accepting external contributions. However, this will soon change after we open source our library in early 2025 with a focus on extensibility. We welcome any feedback and feature requests. Please open an issue if you have any suggestions or questions.

benchmark.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ performance** that can be delivered by Model Optimizer. All performance numbers
66

77
### 1. Post-training quantization (PTQ) for LLMs
88

9-
#### 1.1 Performanace
9+
#### 1.1 Performance
1010

1111
Config: H100, nvidia-modelopt v0.15.0, TensorRT-LLM v0.11, latency measured with full batch inference (no inflight batching).
1212
Memory saving and inference speedup are compared to the FP16 baseline. Speedup is normalized to the GPU count.

0 commit comments

Comments
 (0)