NVIDIA
diff --git a/‎.github/ISSUE_TEMPLATE/1_bug_report.md‎
Lines changed: 15 additions & 88 deletions b/‎.github/ISSUE_TEMPLATE/1_bug_report.md‎
Lines changed: 15 additions & 88 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/3_question.md‎
Lines changed: 43 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/3_question.md‎
Lines changed: 43 additions & 0 deletions
diff --git a/‎.github/ISSUE_TEMPLATE/get_system_info.py‎
Lines changed: 101 additions & 0 deletions b/‎.github/ISSUE_TEMPLATE/get_system_info.py‎
Lines changed: 101 additions & 0 deletions
diff --git a/‎.github/workflows/close_inactive_issues_pr.yml‎
Lines changed: 29 additions & 0 deletions b/‎.github/workflows/close_inactive_issues_pr.yml‎
Lines changed: 29 additions & 0 deletions
diff --git a/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion b/‎.pre-commit-config.yaml‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG-Windows.rst‎
Lines changed: 1 addition & 1 deletion b/‎CHANGELOG-Windows.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/deployment/2_directml.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/deployment/2_directml.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/8_autocast.rst‎
Lines changed: 16 additions & 1 deletion b/‎docs/source/guides/8_autocast.rst‎
Lines changed: 16 additions & 1 deletion
@@ -6,17 +6,32 @@ labels: bug
 assignees: ''
 ---
 
+**Before submitting an issue, please make sure it hasn't been already addressed by searching through the [existing and past issues](https://github.com/NVIDIA/TensorRT-Model-Optimizer/issues?q=is%3Aissue).**
+
 ## Describe the bug
 <!-- Description of what the bug is, its impact (blocker, should have, nice to have) and any stack traces or error messages. -->
 
+- ?
+
 ### Steps/Code to reproduce bug
 <!-- Please list *minimal* steps or code snippet for us to be able to reproduce the bug. -->
 <!-- A helpful guide on on how to craft a minimal bug report http://matthewrocklin.com/blog/work/2018/02/28/minimal-bug-reports. -->
 
+- ?
+
 ### Expected behavior
 
+### Who can help?
+
+<!-- To expedite the response to your issue, it would be helpful if you could identify the appropriate person(s) to tag using the @ symbol.
+If you are unsure about whom to tag, you can leave it blank, and we will make sure to involve the appropriate person. -->
+
+- ?
+
 ## System information
 
+<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
+
 - Container used (if applicable): ?
 - OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
 - CPU architecture (x86_64, aarch64): ?
@@ -33,91 +48,3 @@ assignees: ''
   - ONNXRuntime: ?
   - TensorRT: ?
 - Any other details that may help: ?
-
-<details>
-<summary><b>Click to expand: Python script to automatically collect system information</b></summary>
-
-```python
-import platform
-import re
-import subprocess
-
-
-def get_nvidia_gpu_info():
-    try:
-        nvidia_smi = (
-            subprocess.check_output(
-                "nvidia-smi --query-gpu=name,memory.total,count --format=csv,noheader,nounits",
-                shell=True,
-            )
-            .decode("utf-8")
-            .strip()
-            .split("\n")
-        )
-        if len(nvidia_smi) > 0:
-            gpu_name = nvidia_smi[0].split(",")[0].strip()
-            gpu_memory = round(float(nvidia_smi[0].split(",")[1].strip()) / 1024, 1)
-            gpu_count = len(nvidia_smi)
-            return gpu_name, f"{gpu_memory} GB", gpu_count
-    except Exception:
-        return "?", "?", "?"
-
-
-def get_cuda_version():
-    try:
-        nvcc_output = subprocess.check_output("nvcc --version", shell=True).decode("utf-8")
-        match = re.search(r"release (\d+\.\d+)", nvcc_output)
-        if match:
-            return match.group(1)
-    except Exception:
-        return "?"
-
-
-def get_package_version(package):
-    try:
-        return getattr(__import__(package), "__version__", "?")
-    except Exception:
-        return "?"
-
-
-# Get system info
-os_info = f"{platform.system()} {platform.release()}"
-if platform.system() == "Linux":
-    try:
-        os_info = (
-            subprocess.check_output("cat /etc/os-release | grep PRETTY_NAME | cut -d= -f2", shell=True)
-            .decode("utf-8")
-            .strip()
-            .strip('"')
-        )
-    except Exception:
-        pass
-elif platform.system() == "Windows":
-    print("Please add the `windows` label to the issue.")
-
-cpu_arch = platform.machine()
-gpu_name, gpu_memory, gpu_count = get_nvidia_gpu_info()
-cuda_version = get_cuda_version()
-
-# Print system information in the format required for the issue template
-print("=" * 70)
-print("- Container used (if applicable): " + "?")
-print("- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): " + os_info)
-print("- CPU architecture (x86_64, aarch64): " + cpu_arch)
-print("- GPU name (e.g. H100, A100, L40S): " + gpu_name)
-print("- GPU memory size: " + gpu_memory)
-print("- Number of GPUs: " + str(gpu_count))
-print("- Library versions (if applicable):")
-print("  - Python: " + platform.python_version())
-print("  - ModelOpt version or commit hash: " + get_package_version("modelopt"))
-print("  - CUDA: " + cuda_version)
-print("  - PyTorch: " + get_package_version("torch"))
-print("  - Transformers: " + get_package_version("transformers"))
-print("  - TensorRT-LLM: " + get_package_version("tensorrt_llm"))
-print("  - ONNXRuntime: " + get_package_version("onnxruntime"))
-print("  - TensorRT: " + get_package_version("tensorrt"))
-print("- Any other details that may help: " + "?")
-print("=" * 70)
-```
-
-</details>
@@ -0,0 +1,43 @@
+---
+name: Help needed
+about: Raise an issue here if you don't know how to use ModelOpt
+title: ''
+labels: question
+assignees: ''
+---
+
+Make sure you already checked the [examples](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples) and [documentation](https://nvidia.github.io/TensorRT-Model-Optimizer/) before submitting an issue.
+
+## How would you like to use ModelOpt
+
+<!-- Description of what you would like to do with ModelOpt. -->
+
+- ?
+
+### Who can help?
+
+<!-- To expedite the response to your issue, it would be helpful if you could identify the appropriate person(s) to tag using the @ symbol.
+If you are unsure about whom to tag, you can leave it blank, and we will make sure to involve the appropriate person. -->
+
+- ?
+
+## System information
+
+<!-- Run this script to automatically collect system information: https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/.github/ISSUE_TEMPLATE/get_system_info.py -->
+
+- Container used (if applicable): ?
+- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): ? <!-- If Windows, please add the `windows` label to the issue. -->
+- CPU architecture (x86_64, aarch64): ?
+- GPU name (e.g. H100, A100, L40S): ?
+- GPU memory size: ?
+- Number of GPUs: ?
+- Library versions (if applicable):
+  - Python: ?
+  - ModelOpt version or commit hash: ?
+  - CUDA: ?
+  - PyTorch: ?
+  - Transformers: ?
+  - TensorRT-LLM: ?
+  - ONNXRuntime: ?
+  - TensorRT: ?
+- Any other details that may help: ?
@@ -0,0 +1,101 @@
+# SPDX-FileCopyrightText: Copyright (c) 2024 NVIDIA CORPORATION & AFFILIATES. All rights reserved.
+# SPDX-License-Identifier: Apache-2.0
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+# http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""Python script to automatically collect system information for reporting Issues."""
+
+import contextlib
+import platform
+import re
+import subprocess
+
+
+def get_nvidia_gpu_info():
+    """Get NVIDIA GPU Information."""
+    try:
+        nvidia_smi = (
+            subprocess.check_output(
+                "nvidia-smi --query-gpu=name,memory.total,count --format=csv,noheader,nounits",
+                shell=True,
+            )
+            .decode("utf-8")
+            .strip()
+            .split("\n")
+        )
+        if len(nvidia_smi) > 0:
+            gpu_name = nvidia_smi[0].split(",")[0].strip()
+            gpu_memory = round(float(nvidia_smi[0].split(",")[1].strip()) / 1024, 1)
+            gpu_count = len(nvidia_smi)
+            return gpu_name, f"{gpu_memory} GB", gpu_count
+    except Exception:
+        return "?", "?", "?"
+
+
+def get_cuda_version():
+    """Get CUDA Version."""
+    try:
+        nvcc_output = subprocess.check_output("nvcc --version", shell=True).decode("utf-8")
+        match = re.search(r"release (\d+\.\d+)", nvcc_output)
+        if match:
+            return match.group(1)
+    except Exception:
+        return "?"
+
+
+def get_package_version(package):
+    """Get package version."""
+    try:
+        return getattr(__import__(package), "__version__", "?")
+    except Exception:
+        return "?"
+
+
+# Get system info
+os_info = f"{platform.system()} {platform.release()}"
+if platform.system() == "Linux":
+    with contextlib.suppress(Exception):
+        os_info = (
+            subprocess.check_output(
+                "cat /etc/os-release | grep PRETTY_NAME | cut -d= -f2", shell=True
+            )
+            .decode("utf-8")
+            .strip()
+            .strip('"')
+        )
+elif platform.system() == "Windows":
+    print("Please add the `windows` label to the issue.")
+
+cpu_arch = platform.machine()
+gpu_name, gpu_memory, gpu_count = get_nvidia_gpu_info()
+cuda_version = get_cuda_version()
+
+# Print system information in the format required for the issue template
+print("=" * 70)
+print("- Container used (if applicable): " + "?")
+print("- OS (e.g., Ubuntu 22.04, CentOS 7, Windows 10): " + os_info)
+print("- CPU architecture (x86_64, aarch64): " + cpu_arch)
+print("- GPU name (e.g. H100, A100, L40S): " + gpu_name)
+print("- GPU memory size: " + gpu_memory)
+print("- Number of GPUs: " + str(gpu_count))
+print("- Library versions (if applicable):")
+print("  - Python: " + platform.python_version())
+print("  - ModelOpt version or commit hash: " + get_package_version("modelopt"))
+print("  - CUDA: " + cuda_version)
+print("  - PyTorch: " + get_package_version("torch"))
+print("  - Transformers: " + get_package_version("transformers"))
+print("  - TensorRT-LLM: " + get_package_version("tensorrt_llm"))
+print("  - ONNXRuntime: " + get_package_version("onnxruntime"))
+print("  - TensorRT: " + get_package_version("tensorrt"))
+print("- Any other details that may help: " + "?")
+print("=" * 70)
@@ -0,0 +1,29 @@
+# Ref: https://docs.github.com/en/actions/managing-issues-and-pull-requests/closing-inactive-issues
+name: Close inactive issues and PRs
+on:
+  workflow_dispatch:
+  schedule:
+    - cron: "0 3 * * *"
+
+jobs:
+  stale:
+    runs-on: ubuntu-latest
+    permissions:
+      issues: write
+      pull-requests: write
+    steps:
+      - uses: actions/stale@v10
+        with:
+          repo-token: ${{ secrets.GITHUB_TOKEN }}
+          stale-issue-message: "Issue has not received an update in over 14 days. Adding stale label."
+          stale-pr-message: "PR has not received an update in over 14 days. Adding stale label."
+          close-issue-message: "This issue was closed because it has been 14 days without activity since it has been marked as stale."
+          close-pr-message: "This PR was closed because it has been 14 days without activity since it has been marked as stale."
+          days-before-issue-stale: 14
+          days-before-close: 14
+          only-labels: "waiting for feedback"
+          labels-to-add-when-unstale: "investigating"
+          labels-to-remove-when-unstale: "stale,waiting for feedback"
+          stale-issue-label: "stale"
+          stale-pr-label: "stale"
+          operations-per-run: 1000
@@ -66,7 +66,7 @@ repos:
           - --comment-style
           - "#"
           - --allow-past-years
-        types: [python, shell]
+        types_or: [python, shell]
         # NOTE: Exclude files that have copyright or license headers from another company or individual
         # since we want to keep those above the license header added by this hook.
         # Instead, we should manually add the license header to those files *after* the original header.
 
@@ -29,7 +29,7 @@ Model Optimizer Changelog (Windows)
 - **LLM Quantization with Olive:** Enabled LLM quantization through Olive, streamlining model optimization workflows. Refer `example <https://github.com/microsoft/Olive/tree/main/examples/phi3#quantize-models-with-nvidia-tensorrt-model-optimizer>`_
 - **DirectML Deployment Guide:** Added DML deployment guide. Refer :ref:`DirectML_Deployment`.
 - **MMLU Benchmark for Accuracy Evaluations:** Introduced `MMLU benchmarking <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/accuracy_benchmark/README.md>`_ for accuracy evaluation of ONNX models on DirectML (DML).
-- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613>`_.
+- **Published quantized ONNX models collection:** Published quantized ONNX models at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_.
 
 
 \* *This version includes experimental features such as TensorRT deployment of ONNX INT4 models, PyTorch quantization and sparsity. These are currently unverified on Windows.*
@@ -14,6 +14,7 @@ Model Optimizer Changelog (Linux)
 - Add support for MCore MoE PTQ/QAT/QAD.
 - Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
 - Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
+- Add flags ``nodes_to_include`` and ``op_types_to_include`` in AutoCast to force-include nodes in low precision, even if they would otherwise be excluded by other rules.
 
 **Documentation**
 
 
@@ -98,7 +98,7 @@ more fine-grained control on installed dependencies or for alternative docker im
 
 ## Pre-Quantized Checkpoints
 
-- Ready-to-deploy checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/model-optimizer-66aa84f7966b3150262481a4)\]
+- Ready-to-deploy checkpoints \[[🤗 Hugging Face - Nvidia TensorRT Model Optimizer Collection](https://huggingface.co/collections/nvidia/inference-optimized-checkpoints-with-model-optimizer)\]
 - Deployable on [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM), [vLLM](https://github.com/vllm-project/vllm) and [SGLang](https://github.com/sgl-project/sglang)
 - More models coming soon!
 
 
@@ -42,4 +42,4 @@ For further details and examples, please refer to the `ONNX Runtime documentatio
 Collection of optimized ONNX models
 ===================================
 
-The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus-67373fe7c006ebc1df310613>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
+The ready-to-deploy optimized ONNX models from ModelOpt-Windows are available at HuggingFace `NVIDIA collections <https://huggingface.co/collections/nvidia/optimized-onnx-models-for-nvidia-rtx-gpus>`_. These models can be deployed using DirectML backend. Follow the instructions provided along with the published models for deployment.
@@ -2,7 +2,7 @@ AutoCast (ONNX)
 ###############
 
 AutoCast is a tool for converting FP32 ONNX models to mixed precision FP32-FP16 or FP32-BF16 models.
-While casting FP32 to FP16/BF16, some nodes might be more sensitive to effecting accuracy.
+While casting FP32 to FP16/BF16, some nodes might be more sensitive to affecting accuracy.
 AutoCast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from
 reduced precision on the rest of the nodes. AutoCast automatically injects cast operations around the selected
 nodes.
@@ -31,6 +31,8 @@ AutoCast can also be used programmatically through its Python API:
       low_precision_type="fp16",            # or "bf16"
       nodes_to_exclude=None,                # optional list of node name patterns to keep in FP32
       op_types_to_exclude=None,             # optional list of op types to keep in FP32
+      nodes_to_include=None,                # optional list of node name patterns to force-include in low precision
+      op_types_to_include=None,             # optional list of op types to force-include in low precision
       data_max=512,                         # threshold for node outputs
       init_max=65504,                       # threshold for initializers
       keep_io_types=False,                  # whether to preserve input/output types
@@ -60,6 +62,19 @@ AutoCast follows these steps to convert a model:
    - Analyzes each node in the graph
    - Determines which nodes should remain in FP32 based on input and output tensors magnitudes, operation types and node name patterns
    - If a calibration dataset is provided, it will be used to generate intermediate tensor magnitudes for more accurate node classification, otherwise random data will be used.
+   - Use ``nodes_to_include`` and ``op_types_to_include`` to force-include nodes in low precision, even if they would otherwise be excluded.
+   
+   - Default classification rules. Nodes that meet any of these rules will be kept in high precision:
+     - Node I/O magnitudes are higher than ``data_max`` (default: 512). Due to precision limitations, compute of high magnitude tensors in low precision might not be accurate. The unit in last place (ULP) for 512 is 0.5, for 1024 it is 1.0, etc.
+     - Initializers magnitudes are higher than ``init_max`` (default: 65504). Initializers are often used for non-compute intensive operations and are more likely to be controlled by the user. However, values above ``init_max`` will cause overflow, therefore they are kept in high precision.
+   
+   Additional classification rules (disabled by default):
+     - ``max_depth_of_reduction``: Require nodes with a high depth of reduction (e.g., large matrix multiplications, convolutions with large kernels) to be kept in high precision.
+     - ``nodes_to_exclude``: List of regex patterns for node names to keep in high precision.
+     - ``op_types_to_exclude``: List of operation types to keep in high precision.
+     - ``nodes_to_include``: List of regex patterns for node names to force-include in low precision.
+     - ``op_types_to_include``: List of operation types to force-include in low precision.
+     - ``custom_rule``: Optional custom rule for node classification (inherits from NodeRuleBase).
 
 #. **Precision Conversion**: