You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/ISSUE_TEMPLATE/1_bug_report.md
+15-88Lines changed: 15 additions & 88 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -6,17 +6,32 @@ labels: bug
6
6
assignees: ''
7
7
---
8
8
9
+
**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues?q=is%3Aissue).**
10
+
9
11
## Describe the bug
10
12
<!-- Description of what the bug is, its impact (blocker, should have, nice to have) and any stack traces or error messages. -->
11
13
14
+
- ?
15
+
12
16
### Steps/Code to reproduce bug
13
17
<!-- Please list *minimal* steps or code snippet for us to be able to reproduce the bug. -->
14
18
<!-- A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. -->
15
19
20
+
- ?
21
+
16
22
### Expected behavior
17
23
24
+
### Who can help?
25
+
26
+
<!-- To expedite the response to your issue, it would be helpful if you could identify the appropriate person(s) to tag using the @ symbol.
27
+
If you are unsure about whom to tag, you can leave it blank, and we will make sure to involve the appropriate person. -->
28
+
29
+
- ?
30
+
18
31
## System information
19
32
33
+
<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
34
+
20
35
- Container used (if applicable): ?
21
36
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
22
37
- CPU architecture (x86_64, aarch64): ?
@@ -33,91 +48,3 @@ assignees: ''
33
48
- ONNXRuntime: ?
34
49
- TensorRT: ?
35
50
- Any other details that may help: ?
36
-
37
-
<details>
38
-
<summary><b>Click to expand: Python script to automatically collect system information</b></summary>
about: Raise an issue here if you don't know how to use ModelOpt
4
+
title: ''
5
+
labels: question
6
+
assignees: ''
7
+
---
8
+
9
+
Make sure you already checked the [examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples) and [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) before submitting an issue.
10
+
11
+
## How would you like to use ModelOpt
12
+
13
+
<!-- Description of what you would like to do with ModelOpt. -->
14
+
15
+
- ?
16
+
17
+
### Who can help?
18
+
19
+
<!-- To expedite the response to your issue, it would be helpful if you could identify the appropriate person(s) to tag using the @ symbol.
20
+
If you are unsure about whom to tag, you can leave it blank, and we will make sure to involve the appropriate person. -->
21
+
22
+
- ?
23
+
24
+
## System information
25
+
26
+
<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
27
+
28
+
- Container used (if applicable): ?
29
+
- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
- **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
32
-
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613>`_.
32
+
- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.
33
33
34
34
35
35
\* *This version includes experimental features such as TensorRT deployment of ONNX INT4 models, PyTorch quantization and sparsity. These are currently unverified on Windows.*
Copy file name to clipboardExpand all lines: CHANGELOG.rst
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -14,6 +14,7 @@ Model Optimizer Changelog (Linux)
14
14
- Add support for MCore MoE PTQ/QAT/QAD.
15
15
- Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
16
16
- Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
17
+
- Add flags ``nodes_to_include`` and ``op_types_to_include`` in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
Copy file name to clipboardExpand all lines: README.md
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -98,7 +98,7 @@ more fine-grained control on installed dependencies or for alternative docker im
98
98
99
99
## Pre-Quantized Checkpoints
100
100
101
-
- Ready-to-deploy checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
101
+
- Ready-to-deploy checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\]
102
102
- Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)
Copy file name to clipboardExpand all lines: docs/source/deployment/2_directml.rst
+1-1Lines changed: 1 addition & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -42,4 +42,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio
42
42
Collection of optimized ONNX models
43
43
===================================
44
44
45
-
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
45
+
The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
Copy file name to clipboardExpand all lines: docs/source/guides/8_autocast.rst
+16-1Lines changed: 16 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -2,7 +2,7 @@ AutoCast (ONNX)
2
2
###############
3
3
4
4
AutoCast is a tool for converting FP32 ONNX models to mixed precision FP32-FP16 or FP32-BF16 models.
5
-
While casting FP32 to FP16/BF16, some nodes might be more sensitive to effecting accuracy.
5
+
While casting FP32 to FP16/BF16, some nodes might be more sensitive to affecting accuracy.
6
6
AutoCast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from
7
7
reduced precision on the rest of the nodes. AutoCast automatically injects cast operations around the selected
8
8
nodes.
@@ -31,6 +31,8 @@ AutoCast can also be used programmatically through its Python API:
31
31
low_precision_type="fp16", # or "bf16"
32
32
nodes_to_exclude=None, # optional list of node name patterns to keep in FP32
33
33
op_types_to_exclude=None, # optional list of op types to keep in FP32
34
+
nodes_to_include=None, # optional list of node name patterns to force-include in low precision
35
+
op_types_to_include=None, # optional list of op types to force-include in low precision
34
36
data_max=512, # threshold for node outputs
35
37
init_max=65504, # threshold for initializers
36
38
keep_io_types=False, # whether to preserve input/output types
@@ -60,6 +62,19 @@ AutoCast follows these steps to convert a model:
60
62
- Analyzes each node in the graph
61
63
- Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns
62
64
- If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
65
+
- Use ``nodes_to_include`` and ``op_types_to_include`` to force-include nodes in low precision, even if they would otherwise be excluded.
66
+
67
+
- Default classification rules. Nodes that meet any of these rules will be kept in high precision:
68
+
- Node I/O magnitudes are higher than ``data_max`` (default: 512). Due to precision limitations, compute of high magnitude tensors in low precision might not be accurate. The unit in last place (ULP) for 512 is 0.5, for 1024 it is 1.0, etc.
69
+
- Initializers magnitudes are higher than ``init_max`` (default: 65504). Initializers are often used for non-compute intensive operations and are more likely to be controlled by the user. However, values above ``init_max`` will cause overflow, therefore they are kept in high precision.
70
+
71
+
Additional classification rules (disabled by default):
72
+
- ``max_depth_of_reduction``: Require nodes with a high depth of reduction (e.g., large matrix multiplications, convolutions with large kernels) to be kept in high precision.
73
+
- ``nodes_to_exclude``: List of regex patterns for node names to keep in high precision.
74
+
- ``op_types_to_exclude``: List of operation types to keep in high precision.
75
+
- ``nodes_to_include``: List of regex patterns for node names to force-include in low precision.
76
+
- ``op_types_to_include``: List of operation types to force-include in low precision.
77
+
- ``custom_rule``: Optional custom rule for node classification (inherits from NodeRuleBase).
0 commit comments