NVIDIA
diff --git a/‎.dockerignore‎
Lines changed: 71 additions & 0 deletions b/‎.dockerignore‎
Lines changed: 71 additions & 0 deletions
diff --git a/‎.gitignore‎
Lines changed: 22 additions & 139 deletions b/‎.gitignore‎
Lines changed: 22 additions & 139 deletions
diff --git a/‎README.md‎
Lines changed: 23 additions & 15 deletions b/‎README.md‎
Lines changed: 23 additions & 15 deletions
diff --git a/‎benchmark.md‎
Lines changed: 1 addition & 1 deletion b/‎benchmark.md‎
Lines changed: 1 addition & 1 deletion
@@ -0,0 +1,71 @@
+docker
+**/.git
+llm_ptq/saved_models*
+
+##### Copied from .gitignore #####
+# Byte-compiled / optimized / DLL files
+**/__pycache__
+**.py[cod]
+**$py.class
+
+# C, CPP extensions
+*.so
+*.so.lock
+**.rendered.*.cpp
+**.rendered.*.o
+
+# Distribution / packaging
+build/
+dist/
+*.egg-info/
+
+# Unit test / coverage reports
+htmlcov/
+.coverage
+.coverage.*
+coverage.xml
+.pytest_cache/
+
+# Sphinx documentation
+docs/_build
+docs/build
+docs/source/reference/generated
+
+# Jupyter Notebook
+**/.ipynb_checkpoints
+
+# Environments
+.env
+.venv
+env/
+venv/
+
+# mypy
+**/.mypy_cache
+
+# Vscode
+.vscode/*
+!.vscode/settings.json
+!.vscode/extensions.json
+
+# Mac stuff
+**/.DS_Store
+
+# Ignore experiment checkpoints
+**.pt
+**.pth.tar
+**.pth
+**.pb
+**.onnx
+**.ckpt
+**.safetensors
+**.bin
+**.pkl
+**.tar.gz
+**.nemo
+
+# Ignore temporary files created by tox
+pyproject.toml.bak
+
+# Ignore git clones for tests
+medusa-vicuna-7b-v1.3/
@@ -1,176 +1,51 @@
 # Byte-compiled / optimized / DLL files
-__pycache__/
-*.py[cod]
-*$py.class
+**/__pycache__
+**.py[cod]
+**$py.class
 
 # C, CPP extensions
 *.so
-.rendered.*.cpp
-.rendered.*.o
-# Exclude the plugin file
-!libfp8convkernel.so
-
+*.so.lock
+**.rendered.*.cpp
+**.rendered.*.o
 
 # Distribution / packaging
-.Python
 build/
-develop-eggs/
 dist/
-downloads/
-eggs/
-.eggs/
-lib/
-lib64/
-parts/
-sdist/
-var/
-wheels/
-share/python-wheels/
 *.egg-info/
-.installed.cfg
-*.egg
-MANIFEST
-
-# PyInstaller
-#  Usually these files are written by a python script from a template
-#  before PyInstaller builds the exe, so as to inject date/other infos into it.
-*.manifest
-*.spec
-
-# Installer logs
-pip-log.txt
-pip-delete-this-directory.txt
 
 # Unit test / coverage reports
 htmlcov/
-.tox/
-.nox/
 .coverage
 .coverage.*
-.cache
-nosetests.xml
 coverage.xml
-*.cover
-*.py,cover
-.hypothesis/
 .pytest_cache/
-cover/
-
-# Translations
-*.mo
-*.pot
-
-# Django stuff:
-*.log
-local_settings.py
-db.sqlite3
-db.sqlite3-journal
-
-# Flask stuff:
-instance/
-.webassets-cache
-
-# Scrapy stuff:
-.scrapy
 
 # Sphinx documentation
-docs/_build/
-
-# PyBuilder
-.pybuilder/
-target/
+docs/_build
+docs/build
+docs/source/reference/generated
 
 # Jupyter Notebook
-.ipynb_checkpoints
-
-# IPython
-profile_default/
-ipython_config.py
-
-# pyenv
-#   For a library or package, you might want to ignore these files since the code is
-#   intended to run in multiple environments; otherwise, check them in:
-# .python-version
-
-# pipenv
-#   According to pypa/pipenv#598, it is recommended to include Pipfile.lock in version control.
-#   However, in case of collaboration, if having platform-specific dependencies or dependencies
-#   having no cross-platform support, pipenv may install dependencies that don't work, or not
-#   install all needed dependencies.
-#Pipfile.lock
-
-# poetry
-#   Similar to Pipfile.lock, it is generally recommended to include poetry.lock in version control.
-#   This is especially recommended for binary packages to ensure reproducibility, and is more
-#   commonly ignored for libraries.
-#   https://python-poetry.org/docs/basic-usage/#commit-your-poetrylock-file-to-version-control
-#poetry.lock
-
-# pdm
-#   Similar to Pipfile.lock, it is generally recommended to include pdm.lock in version control.
-#pdm.lock
-#   pdm stores project-wide configurations in .pdm.toml, but it is recommended to not include it
-#   in version control.
-#   https://pdm.fming.dev/#use-with-ide
-.pdm.toml
-
-# PEP 582; used by e.g. github.com/David-OConnor/pyflow and github.com/pdm-project/pdm
-__pypackages__/
-
-# Celery stuff
-celerybeat-schedule
-celerybeat.pid
-
-# SageMath parsed files
-*.sage.py
+**/.ipynb_checkpoints
 
 # Environments
 .env
 .venv
 env/
 venv/
-ENV/
-env.bak/
-venv.bak/
-
-# Spyder project settings
-.spyderproject
-.spyproject
-
-# Rope project settings
-.ropeproject
-
-# mkdocs documentation
-/site
 
-# mypy
-.mypy_cache/
-.dmypy.json
-dmypy.json
-
-# Pyre type checker
-.pyre/
-
-# pytype static type analyzer
-.pytype/
-
-# Cython debug symbols
-cython_debug/
-
-# PyCharm
-#  JetBrains specific template is maintained in a separate JetBrains.gitignore that can
-#  be found at https://github.com/github/gitignore/blob/main/Global/JetBrains.gitignore
-#  and can be added to the global gitignore or merged into this file.  For a more nuclear
-#  option (not recommended) you can uncomment the following to ignore the entire idea folder.
-#.idea/
+# Linters
+**/.mypy_cache
+**/.ruff_cache
 
 # Vscode
 .vscode/*
 !.vscode/settings.json
 !.vscode/extensions.json
 
 # Mac stuff
-.DS_Store
+**/.DS_Store
 
 # Ignore experiment checkpoints
 **.pt
@@ -182,3 +57,11 @@ cython_debug/
 **.safetensors
 **.bin
 **.pkl
+**.tar.gz
+**.nemo
+
+# Ignore temporary files created by tox
+pyproject.toml.bak
+
+# Ignore git clones for tests
+medusa-vicuna-7b-v1.3/
@@ -9,8 +9,8 @@
 [![license](https://img.shields.io/badge/License-MIT-blue)](./LICENSE)
 
 [Examples](#examples) |
-[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer)  |
-[Benchmark Results](./benchmark.md) |
+[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
+[Benchmark Results](#benchmark) |
 [Roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues/108) |
 [ModelOpt-Windows](./windows/README.md)
 
@@ -33,23 +33,24 @@
 ## Table of Contents
 
 - [Model Optimizer Overview](#model-optimizer-overview)
-- [Installation](#installation)
+- [Installation](#installation--docker)
 - [Techniques](#techniques)
   - [Quantization](#quantization)
-  - [Sparsity](#sparsity)
   - [Distillation](#distillation)
   - [Pruning](#pruning)
+  - [Sparsity](#sparsity)
 - [Examples](#examples)
-- [Support Matrix](#support-matrix)
+- [Support Matrix](#model-support-matrix)
 - [Benchmark](#benchmark)
 - [Quantized Checkpoints](#quantized-checkpoints)
 - [Roadmap](#roadmap)
 - [Release Notes](#release-notes)
+- [Contributing](#contributing)
 
 ## Model Optimizer Overview
 
 Minimizing inference costs presents a significant challenge as generative AI models continue to grow in complexity and size.
-The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [sparsity](#sparsity), [distillation](#distillation), and [pruning](#pruning) to compress models.
+The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or **ModelOpt**) is a library comprising state-of-the-art model optimization techniques including [quantization](#quantization), [distillation](#distillation), [pruning](#pruning), and [sparsity](#sparsity) to compress models.
 It accepts a torch or [ONNX](https://github.com/onnx/onnx) model as inputs and provides Python APIs for users to easily stack different model optimization techniques to produce an optimized quantized checkpoint.
 Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization) or [TensorRT](https://github.com/NVIDIA/TensorRT).
 ModelOpt is integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo) and [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) for training-in-the-loop optimization techniques.
@@ -72,7 +73,7 @@ cd TensorRT-Model-Optimizer
 
 # Build the docker (will be tagged `docker.io/library/modelopt_examples:latest`)
 # You may customize `docker/Dockerfile` to include or exclude certain dependencies you may or may not need.
-bash docker/build.sh
+./docker/build.sh
 
 # Run the docker image
 docker run --gpus all -it --shm-size 20g --rm docker.io/library/modelopt_examples:latest bash
@@ -91,18 +92,18 @@ NOTE: Unless specified otherwise, all example READMEs assume they are using the
 
 Quantization is an effective model optimization technique for large models. Quantization with Model Optimizer can compress model size by 2x-4x, speeding up inference while preserving model quality. Model Optimizer enables highly performant quantization formats including FP8, INT8, INT4, etc and supports advanced algorithms such as SmoothQuant, AWQ, and Double Quantization with easy-to-use Python APIs. Both Post-training quantization (PTQ) and Quantization-aware training (QAT) are supported.
 
-### Sparsity
+### Distillation
 
-Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
+Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
+by using a more powerful model's learned features to guide a student model's objective function into imitating it.
 
 ### Pruning
 
 Pruning is a technique to reduce the model size and accelerate the inference by removing unnecessary weights. Model Optimizer provides Python APIs to prune Linear and Conv layers, and Transformer attention heads, MLP, embedding hidden size and number of layers (depth).
 
-### Distillation
+### Sparsity
 
-Knowledge Distillation allows for increasing the accuracy and/or convergence speed of a desired model architecture
-by using a more powerful model's learned features to guide a student model's objective function into imitating it.
+Sparsity is a technique to further reduce the memory footprint of deep learning models and accelerate the inference. Model Optimizer Python APIs to apply weight sparsity to a given model. It also supports [NVIDIA 2:4 sparsity pattern](https://arxiv.org/pdf/2104.08378) and various sparsification methods, such as [NVIDIA ASP](https://github.com/NVIDIA/apex/tree/master/apex/contrib/sparsity) and [SparseGPT](https://arxiv.org/abs/2301.00774).
 
 ## Examples
 
@@ -121,15 +122,18 @@ by using a more powerful model's learned features to guide a student model's obj
 - [ONNX PTQ](./onnx_ptq/README.md) shows how to quantize the ONNX models in INT4 or INT8 quantization mode. The examples also include the deployment of quantized ONNX models using TensorRT.
 - [Distillation for LLMs](./llm_distill/README.md) demonstrates how to use Knowledge Distillation, which can increasing the accuracy and/or convergence speed for finetuning / QAT.
 - [Chained Optimizations](./chained_optimizations/README.md) shows how to chain multiple optimizations together (e.g. Pruning + Distillation + Quantization).
+- [Model Hub](./model_hub/) provides an example to deploy and run quantized Llama 3.1 8B instruct model from Nvidia's Hugging Face model hub on both TensorRT-LLM and vLLM.
 
-## Support Matrix
+## Model Support Matrix
 
 - For LLM quantization, please refer to this [support matrix](./llm_ptq/README.md#model-support-list).
-- For Diffusion, the Model Optimizer supports [Stable Diffusion 1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), and [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo).
+- For VLM quantization, please refer to this [support matrix](./vlm_ptq/README.md#model-support-list).
+- For Diffusion, Model Optimizer supports [FLUX](https://huggingface.co/black-forest-labs/FLUX.1-dev), [Stable Diffusion 3](https://huggingface.co/stabilityai/stable-diffusion-3-medium), [Stable Diffusion XL](https://huggingface.co/papers/2307.01952), [SDXL-Turbo](https://huggingface.co/stabilityai/sdxl-turbo), and [Stable Diffusion 2.1](https://huggingface.co/stabilityai/stable-diffusion-2-1).
+- For speculative decoding, please refer to this [support matrix](./speculative_decoding/README.md#model-support-list).
 
 ## Benchmark
 
-Please find the benchmarks [here](./benchmark.md).
+Please find the benchmarks at [here](./benchmark.md).
 
 ## Quantized Checkpoints
 
@@ -142,3 +146,7 @@ Please see our [product roadmap](https://github.com/NVIDIA/TensorRT-Model-Optimi
 ## Release Notes
 
 Please see Model Optimizer Changelog [here](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html).
+
+## Contributing
+
+At the moment, we are not accepting external contributions. However, this will soon change after we open source our library in early 2025 with a focus on extensibility. We welcome any feedback and feature requests. Please open an issue if you have any suggestions or questions.
@@ -6,7 +6,7 @@ performance** that can be delivered by Model Optimizer. All performance numbers
 
 ### 1. Post-training quantization (PTQ) for LLMs
 
-#### 1.1 Performanace
+#### 1.1 Performance
 
 Config: H100, nvidia-modelopt v0.15.0, TensorRT-LLM v0.11, latency measured with full batch inference (no inflight batching).
 Memory saving and inference speedup are compared to the FP16 baseline. Speedup is normalized to the GPU count.