Skip to content

Commit 9affa87

Browse files
Update examples for 0.19.0-windows release
1 parent f713839 commit 9affa87

File tree

12 files changed

+2485
-1
lines changed

12 files changed

+2485
-1
lines changed

README.md

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,8 @@
1010

1111
[Examples](#examples) |
1212
[Benchmark Results](./benchmark.md) |
13-
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer)
13+
[Documentation](https://nvidia.github.io/TensorRT-Model-Optimizer) |
14+
[ModelOpt-Windows](./windows/README.md)
1415

1516
</div>
1617

windows/Benchmark.md

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,41 @@
1+
# TensorRT Model Optimizer - Windows: Benchmark Reference
2+
3+
This document provides a summary of the performance and accuracy measurements of [TensorRT Model Optimizer - Windows](https://github.com/NVIDIA/TensorRT-Model-Optimizer) for several popular models. The benchmark results in the following tables serve as reference points and **should not be viewed as the maximum performance** achievable by Model Optimizer - Windows.
4+
5+
### 1 Performance And Accuracy Comparison: ONNX INT4 vs ONNX FP16 Models
6+
7+
#### 1.1 Performance Comparison
8+
9+
All performance metrics are tested using the [onnxruntime-genai perf benchmark](https://github.com/microsoft/onnxruntime-genai/tree/main/benchmark/python) with the DirectML backend.
10+
11+
- **Configuration**: Windows OS, GPU RTX 4090, NVIDIA Model Optimizer v0.19.0.
12+
- **Batch Size**: 1
13+
14+
Memory savings and inference speedup are compared to the ONNX FP16 baseline.
15+
16+
| | | | | |
17+
|:------------------------|:------------------------|:-------------------------|:----------------------|:-----------------------------------------|
18+
| **Model** | **Input Prompt Length** | **Output tokens length** | **GPU Memory Saving** | **Generation Phase Inference Speedup** |
19+
|[Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | 128 | 256 | 2.44x | 2.68x |
20+
|[Phi3.5-mini-Instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) | 128 | 256 | 2.53x | 2.51x |
21+
|[Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 128 | 256 | 2.88x | 3.41x |
22+
|[Llama3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 128 | 256 | 1.96x | 2.19x |
23+
|[Gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 128 | 256 | 1.64x | 1.94x |
24+
25+
#### 1.2 Accuracy Comparison
26+
27+
For accuracy evaluation, the [Massive Multitask Language Understanding (MMLU)](https://arxiv.org/abs/2009.03300) benchmark has been utilized. Please refer to the [detailed instructions](./accuracy_benchmark/README.md) for running the MMLU accuracy benchmark.
28+
29+
The table below shows the MMLU 5-shot score for some models.
30+
31+
- **FP16 ONNX model**: Generated using [GenAI Model Builder](https://github.com/microsoft/onnxruntime-genai/blob/main/src/python/py/models/README.md)
32+
- **INT4 AWQ model**: Generated by quantizing FP16 ONNX model using ModelOpt-Windows
33+
- **Configuration**: Windows OS, GPU RTX4090, nvidia-modelopt v0.19.0.
34+
35+
| **Model** | **ONNX FP16** | **ONNX INT4** |
36+
|:------------------------------|:---------------|:--------------|
37+
| [Llama3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) | 68.45 | 66.1 |
38+
| [Phi3.5-mini-Instruct](https://huggingface.co/microsoft/Phi-3.5-mini-instruct) | 68.9 | 65.7 |
39+
| [Mistral-7B-Instruct-v0.3](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.3) | 61.76 | 60.73 |
40+
| [Llama3.2-3B-Instruct](https://huggingface.co/meta-llama/Llama-3.2-3B-Instruct) | 60.8 | 57.71 |
41+
| [Gemma-2b-it](https://huggingface.co/google/gemma-2b-it) | 37.01 | 37.2 |

windows/README.md

Lines changed: 93 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,93 @@
1+
<div align="center">
2+
3+
# NVIDIA TensorRT Model Optimizer - Windows
4+
5+
#### A Library to Quantize and Compress Deep Learning Models for Optimized Inference on Native Windows RTX GPUs
6+
7+
[![Documentation](https://img.shields.io/badge/Documentation-latest-brightgreen.svg?style=flat)](https://nvidia.github.io/TensorRT-Model-Optimizer/)
8+
[![version](https://img.shields.io/pypi/v/nvidia-modelopt?label=Release)](https://pypi.org/project/nvidia-modelopt/)
9+
[![license](https://img.shields.io/badge/License-MIT-blue)](./LICENSE)
10+
11+
[Examples](#examples) |
12+
[Benchmark Results](#benchmark-results)
13+
14+
</div>
15+
16+
## Latest News
17+
18+
- \[2024/11/18\] [Quantized INT4 ONNX models available on Hugging Face for download](https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613)
19+
20+
## Table of Contents
21+
22+
- [Overview](#overview)
23+
- [Installation](#installation)
24+
- [Techniques](#techniques)
25+
- [Quantization](#quantization)
26+
- [Examples](#examples)
27+
- [Support Matrix](#support-matrix)
28+
- [Benchmark Results](#benchmark-results)
29+
- [Collection of Optimized ONNX Models](#collection-of-optimized-onnx-models)
30+
- [Release Notes](#release-notes)
31+
32+
## Overview
33+
34+
The **TensorRT Model Optimizer - Windows** (**ModelOpt-Windows**) is engineered to deliver advanced model compression techniques, including quantization, to Windows RTX PC systems. Specifically tailored to meet the needs of Windows users, ModelOpt-Windows is optimized for rapid and efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times.
35+
The primary objective of the ModelOpt-Windows is to generate optimized, standards-compliant ONNX-format models for DirectML backends. This makes it an ideal solution for seamless integration with ONNX Runtime (ORT) and DirectML (DML) frameworks, ensuring broad compatibility with any inference framework supporting the ONNX standard. Furthermore, ModelOpt-Windows integrates smoothly within the Windows ecosystem, with full support for tools and SDKs such as Olive and ONNX Runtime, enabling deployment of quantized models across various independent hardware vendors (IHVs) through the DML path and TensorRT path.
36+
37+
Model Optimizer is available for free for all developers on [NVIDIA PyPI](https://pypi.org/project/nvidia-modelopt/). This repository is for sharing examples and GPU-optimized recipes as well as collecting feedback from the community.
38+
39+
## Installation
40+
41+
ModelOpt-Windows can be installed either as a standalone toolkit or through Microsoft's Olive.
42+
43+
### Standalone Toolkit Installation (with CUDA 12.x)
44+
45+
To install ModelOpt-Windows as a standalone toolkit with CUDA 12.x support, run the following commands:
46+
47+
```bash
48+
pip install nvidia-modelopt[onnx]~=0.19.0 --extra-index-url https://pypi.nvidia.com
49+
pip install cupy-cuda12x
50+
```
51+
52+
### Installation with Olive
53+
54+
To install ModelOpt-Windows through Microsoft's Olive, use the following commands:
55+
56+
```bash
57+
pip install olive-ai[nvmo]
58+
pip install onnxruntime-genai-directml>=0.4.0
59+
pip install onnxruntime-directml==1.20.0
60+
```
61+
62+
For more details, please refer to the [detailed installation instructions](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/2_installation.html).
63+
64+
## Techniques
65+
66+
### Quantization
67+
68+
Quantization is an effective model optimization technique for large models. Quantization with ModelOpt-Windows can compress model size by 2x-4x, speeding up inference while preserving model quality. ModelOpt-Window enables highly performant quantization formats including INT4, FP8\*, INT8\*, etc. and supports advanced algorithms such as AWQ and SmoothQuant\* focusing on post-training quantization (PTQ) for ONNX and PyTorch\* models with DirectML and TensorRT\* inference backends.
69+
70+
For more details, please refer to the [detailed quantization guide](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/windows_guides/_ONNX_PTQ_guide.html).
71+
72+
## Examples
73+
74+
- [PTQ for LLMs](./onnx_ptq/README.md) covers how to use Post-training quantization (PTQ) and deployment with DirectML
75+
- [MMLU Benchmark](./accuracy_benchmark/README.md) provides an example script for MMLU benchmark and demonstrates how to run it with various popular backends like DirectML, TensorRT-LLM\* and model formats like ONNX and PyTorch\*.
76+
77+
## Support Matrix
78+
79+
Please refer to [feature support matrix](https://nvidia.github.io/TensorRT-Model-Optimizer/getting_started/windows/_feature_support_matrix.html) for a full list of supported features.
80+
81+
## Benchmark Results
82+
83+
Please refer to [benchmark results](./Benchmark.md) for performance and accuracy comparisons of popular Large Language Models (LLMs).
84+
85+
## Collection Of Optimized ONNX Models
86+
87+
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at [HuggingFace NVIDIA collections](https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613). These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
88+
89+
## Release Notes
90+
91+
Please refer to [changelog](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/0_changelog.html)
92+
93+
\* *Experimental support*
Lines changed: 187 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,187 @@
1+
## Table of Contents
2+
3+
- [Overview](#overview)
4+
- [Prerequisites](#prerequisites)
5+
- [Accuracy Benchmarks](#accuracy-benchmarks)
6+
- [MMLU (Massive Multitask Language Understanding)](#mmlu-massive-multitask-language-understanding)
7+
- [Setup](#setup)
8+
- [Evaluation Methods](#evaluation-methods)
9+
- [1. Evaluate with ORT-DML using GenAI APIs](#1-evaluate-with-ort-dml-using-genai-apis)
10+
- [2. Evaluate with ORT-DML, CUDA, and CPU using Native ORT Path](#2-evaluate-with-ort-dml-cuda-and-cpu-using-native-ort-path)
11+
- [3. Evaluate the PyTorch Model of HF Weights](#3-evaluate-the-pytorch-model-of-hf-weights)
12+
- [4. Evaluate the TensorRT-LLM](#4-evaluate-the-tensorrt-llm)
13+
- [5. Evaluate the PyTorch Model Quantized with AutoAWQ](#5-evaluate-the-pytorch-model-quantized-with-autoawq)
14+
15+
## Overview
16+
17+
This repository provides scripts, popular third-party benchmarks, and instructions for evaluating the accuracy of Large Language Models (LLMs). It demonstrates how to use a ModelOpt quantized LLM with various established benchmarks, including deployment options using DirectML and TensorRT-LLM in a Windows environment.
18+
19+
## Prerequisites
20+
21+
| **Category** | **Details** |
22+
|:--------------------------|-------------------------------------------------------------------------------------------------------------|
23+
| **Operating System** | Windows 10 or later |
24+
| **Python** | - For ORT-DML GenAI, use Python 3.11. <br> - For TensorRT-LLM, use Python 3.10. <br> - All other backends are compatible with both Python 3.10 and 3.11. |
25+
| **Package Manager** | pip |
26+
| **Compatible Hardware and Drivers** | - Ensure necessary hardware (e.g., CUDA-compatible GPU) and drivers are installed, depending on the evaluation method: <br> - DirectML for DirectML-based evaluation <br> - CUDA for TensorRT |
27+
| **Additional Tools** | - **cmd**: Recommended for running the provided commands. <br> - **Tar Utility**: Included in Windows 10 and later via PowerShell. <br> - **Curl**: Included in Windows 10 and later via PowerShell. |
28+
29+
# Accuracy Benchmarks
30+
31+
## MMLU (Massive Multitask Language Understanding)
32+
33+
The MMLU benchmark assesses LLM performance across a wide range of tasks, producing a score between 0 and 1, where a higher score indicates better accuracy. Please refer the [MMLU Paper](https://arxiv.org/abs/2009.03300) for more details on this.
34+
35+
### MMLU Setup
36+
37+
The table below lists the setup steps to prepare your environment for evaluating LLMs using the MMLU benchmark.
38+
39+
| **Step** | **Command** or **Description** |
40+
|----------------------------------------|-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|
41+
| **Open PowerShell as Administrator** | - |
42+
| **Create and Activate a Virtual Environment** <br> _(Optional but Recommended)_ | `python -m venv llm_env` <br> `.\llm_env\Scripts\Activate.ps1` |
43+
| **Install PyTorch and Related Packages** | `pip install torch==2.4.0 torchvision==0.19.0 torchaudio==2.4.0 --index-url https://download.pytorch.org/whl/cu124` |
44+
| **Install ONNX Runtime Packages** | `pip install onnxruntime-directml==1.20` <br> `pip install onnxruntime-genai-directml==0.4.0` |
45+
| **Install Benchmark Requirements** | `pip install -r requirements.txt` |
46+
| **Download MMLU Data** | `mkdir data` <br> `curl -o .\data\mmlu.tar https://people.eecs.berkeley.edu/~hendrycks/data.tar` <br> `tar -xf .\data\mmlu.tar -C .\data` <br> `Move-Item .\data\data .\data\mmlu` |
47+
48+
### Evaluation Methods
49+
50+
Once the MMLU benchmark is set up, you can use the `mmlu_benchmark.py` script to evaluate LLMs deployed with various backends. Please refer examples below.
51+
52+
<details>
53+
<summary> MMLU Benchmark with GenAI APIs for ORT-DML Deployment</summary>
54+
<br>
55+
56+
To run the model with ORT-DML using GenAI, use the `--ep genai_dml` argument.
57+
58+
- **Test Suite**
59+
60+
```powershell
61+
python mmlu_benchmark.py `
62+
--model_name causal `
63+
--model_path <ONNX_model_folder> `
64+
--ep genai_dml `
65+
--output_file <output_log_file.json> `
66+
--ntrain 5
67+
```
68+
69+
- **Specific Subjects**
70+
71+
```powershell
72+
python mmlu_benchmark.py `
73+
--model_name causal `
74+
--model_path <ONNX_model_folder> `
75+
--ep genai_dml `
76+
--output_file <output_log_file.json> `
77+
--subject abstract_algebra,anatomy,college_mathematics `
78+
--ntrain 5
79+
```
80+
81+
</details>
82+
83+
<details>
84+
<summary>MMLU Benchmark with ONNX Runtime APIs for DML, CUDA, or CPU Deployment</summary>
85+
<br>
86+
87+
To run the model with ORT-DML, ORT-CUDA or ORT-CPU execution providers, use `--ep ort_dml`, `--ep ort_cuda`, or `--ep ort_cpu` respectively.
88+
89+
- **Test Suite**
90+
91+
```powershell
92+
python mmlu_benchmark.py `
93+
--model_name causal `
94+
--model_path <ONNX_model_folder> `
95+
--ep ort_dml `
96+
--output_file <output_log_file.json> `
97+
--ntrain 5
98+
```
99+
100+
- **Specific Subjects**
101+
102+
```powershell
103+
python mmlu_benchmark.py `
104+
--model_name causal `
105+
--model_path <ONNX_model_folder> `
106+
--ep ort_dml `
107+
--output_file <output_log_file.json> `
108+
--subject abstract_algebra,anatomy,college_mathematics `
109+
--ntrain 5
110+
```
111+
112+
</details>
113+
114+
<details>
115+
<summary>MMLU Benchmark with Transformer APIs for PyTorch Hugging Face Models</summary>
116+
<br>
117+
118+
To evaluate the PyTorch Hugging Face (HF) model, use the `--ep pt` argument.
119+
120+
- **Test Suite**
121+
122+
```powershell
123+
python mmlu_benchmark.py `
124+
--model_name causal `
125+
--model_path <ONNX_model_folder> `
126+
--ep pt `
127+
--output_file <output_log_file.json> `
128+
--ntrain 5 `
129+
--dtype <torch_dtype in model's config.json {float16|bfloat16}>
130+
```
131+
132+
- **Specific Subjects**
133+
134+
```powershell
135+
python mmlu_benchmark.py `
136+
--model_name causal `
137+
--model_path <ONNX_model_folder> `
138+
--ep pt `
139+
--output_file <output_log_file.json> `
140+
--subject abstract_algebra,anatomy,college_mathematics `
141+
--ntrain 5 `
142+
--dtype <torch_dtype in model's config.json {float16|bfloat16}>
143+
```
144+
145+
</details>
146+
147+
<details>
148+
<summary>MMLU Benchmark with TensorRT-LLM APIs for TensorRT-LLM Deployment</summary>
149+
<br>
150+
151+
1. **Install TensorRT-LLM and Compatible PyTorch**
152+
153+
```powershell
154+
pip install torch==2.4.0+cu121 --index-url https://download.pytorch.org/whl
155+
pip install tensorrt_llm==0.12.0 `
156+
--extra-index-url https://pypi.nvidia.com `
157+
--extra-index-url https://download.pytorch.org/whl/cu121/torch/
158+
```
159+
160+
1. **Run the Benchmark**
161+
162+
- **Test Suite**
163+
164+
```powershell
165+
python mmlu_benchmark.py `
166+
--model_name causal `
167+
--hf_model_dir <hf_model_path> `
168+
--engine_dir <engine_path> `
169+
--ep trt-llm `
170+
--ntrain 5 `
171+
--output_file result.json
172+
```
173+
174+
- **Specific Subjects**
175+
176+
```powershell
177+
python mmlu_benchmark.py `
178+
--model_name causal `
179+
--hf_model_dir <hf_model_path> `
180+
--engine_dir <engine_path> `
181+
--ep trt-llm `
182+
--ntrain 5 `
183+
--output_file result.json `
184+
--subject abstract_algebra,anatomy,college_mathematics
185+
```
186+
187+
</details>

0 commit comments

Comments
 (0)