NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 11 additions & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 11 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 1 deletion b/‎README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/getting_started/1_overview.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/getting_started/1_overview.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/3_pruning.rst‎
Lines changed: 1 addition & 1 deletion b/‎docs/source/guides/3_pruning.rst‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎docs/source/guides/7_nas.rst‎
Lines changed: 4 additions & 4 deletions b/‎docs/source/guides/7_nas.rst‎
Lines changed: 4 additions & 4 deletions
diff --git a/‎examples/deepseek/ptq.py‎
Lines changed: 5 additions & 5 deletions b/‎examples/deepseek/ptq.py‎
Lines changed: 5 additions & 5 deletions
diff --git a/‎examples/llm_distill/README.md‎
Lines changed: 1 addition & 1 deletion b/‎examples/llm_distill/README.md‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/llm_eval/gen_model_answer.py‎
Lines changed: 1 addition & 16 deletions b/‎examples/llm_eval/gen_model_answer.py‎
Lines changed: 1 addition & 16 deletions
diff --git a/‎examples/llm_eval/mmlu.py‎
Lines changed: 2 additions & 9 deletions b/‎examples/llm_eval/mmlu.py‎
Lines changed: 2 additions & 9 deletions
diff --git a/‎examples/llm_ptq/README.md‎
Lines changed: 14 additions & 51 deletions b/‎examples/llm_ptq/README.md‎
Lines changed: 14 additions & 51 deletions
@@ -1,12 +1,22 @@
 Model Optimizer Changelog (Linux)
 =================================
 
-0.35 (2025-08-xx)
+0.37 (2025-09-xx)
+^^^^^^^^^^^^^^^^^
+
+**Deprecations**
+
+**Bug Fixes**
+
+**New Features**
+
+0.35 (2025-09-04)
 ^^^^^^^^^^^^^^^^^
 
 **Deprecations**
 
 - Deprecate ``torch<2.6`` support.
+- Deprecate NeMo 1.0 model support.
 
 **Bug Fixes**
 
 
@@ -20,7 +20,7 @@ The **NVIDIA TensorRT Model Optimizer** (referred to as **Model Optimizer**, or
 **[Input]** Model Optimizer currently supports inputs of a [Hugging Face](https://huggingface.co/), [PyTorch](https://github.com/pytorch/pytorch) or [ONNX](https://github.com/onnx/onnx) model.
 
 **[Optimize]** Model Optimizer provides Python APIs for users to easily compose the above model optimization techniques and export an optimized quantized checkpoint.
-Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
+Model Optimizer is also integrated with [NVIDIA NeMo](https://github.com/NVIDIA-NeMo/NeMo), [Megatron-LM](https://github.com/NVIDIA/Megatron-LM) and [Hugging Face Accelerate](https://github.com/huggingface/accelerate) for training required inference optimization techniques.
 
 **[Export for deployment]** Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like [SGLang](https://github.com/sgl-project/sglang), [TensorRT-LLM](https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization), [TensorRT](https://github.com/NVIDIA/TensorRT), or [vLLM](https://github.com/vllm-project/vllm).
 
 
@@ -9,7 +9,7 @@ Minimizing inference costs presents a significant challenge as generative AI mod
 The `NVIDIA TensorRT Model Optimizer <https://github.com/NVIDIA/TensorRT-Model-Optimizer>`_ (referred to as Model Optimizer, or ModelOpt)
 is a library comprising state-of-the-art model optimization techniques including quantization and sparsity to compress model.
 It accepts a torch or ONNX model as input and provides Python APIs for users to easily stack different model optimization
-techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
+techniques to produce optimized & quantized checkpoints. Seamlessly integrated within the NVIDIA AI software ecosystem, the quantized checkpoint generated from Model Optimizer is ready for deployment in downstream inference frameworks like `TensorRT-LLM <https://github.com/NVIDIA/TensorRT-LLM/tree/main/examples/quantization>`_ or `TensorRT <https://github.com/NVIDIA/TensorRT>`_ (Linux). ModelOpt is integrated with `NVIDIA NeMo <https://github.com/NVIDIA-NeMo/NeMo>`_ and `Megatron-LM <https://github.com/NVIDIA/Megatron-LM>`_ for training-in-the-loop optimization techniques. For enterprise users, the 8-bit quantization with Stable Diffusion is also available on `NVIDIA NIM <https://developer.nvidia.com/blog/nvidia-nim-offers-optimized-inference-microservices-for-deploying-ai-models-at-scale/>`_.
 
 For Windows users, the `TensorRT Model Optimizer for Windows <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/windows/README.md>`_ (ModelOpt-Windows) delivers model compression techniques, including quantization, on Windows RTX PC systems. ModelOpt-Windows is optimized for efficient quantization, featuring local GPU calibration, reduced system and video memory consumption, and swift processing times. It integrates seamlessly with the Windows ecosystem, with optimized ONNX models as output for `Microsoft DirectML <https://github.com/microsoft/DirectML>`_ backends. Furthermore, ModelOpt-Windows supports SDKs such as `Microsoft Olive <https://github.com/microsoft/Olive>`_ and `ONNX Runtime <https://github.com/microsoft/onnxruntime>`_, enabling the deployment of quantized models across various independent hardware vendors through the DirectML path.
 
 
@@ -4,7 +4,7 @@ Pruning
 
 .. tip::
 
-    Checkout `Llama 3.1 NeMo Minitron Pruning <https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation>`_ and
+    Checkout `Llama 3.1 NeMo Minitron Pruning <https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation>`_ and
     `ResNet20 on CIFAR-10 Notebook <https://github.com/NVIDIA/TensorRT-Model-Optimizer/blob/main/examples/pruning/cifar_resnet.ipynb>`_
     for an end-to-end example of pruning.
 
 
@@ -361,11 +361,11 @@ can be converted into searchable units:
     # search over the number of layers (depth) in the sequential layer.
     nn.Sequential
 
-    # We convert Megatron-core / NeMo GPT-style models (e.g. Llama3.1, NeMo Mistral, etc.)
+    # We convert Megatron-core / NeMo GPT or Mamba style models (e.g. Llama3.1, NeMo Mistral, NeMotron-H, etc.)
     # to automatically search over the MLP hidden size, number of attention heads, number of GQA groups,
-    # and depth of the model.
-    megatron.core.transformer.module.MegatronModule
-    nemo.collections.nlp.models.language_modeling.megatron_gpt_model.MegatronGPTModel
+    # number of mamba heads, mamba head dimension, and depth of the model.
+    megatron.core.models.gpt.GPTModel
+    megatron.core.models.mamba.MambaModel
     nemo.collections.llm.gpt.model.base.GPTModel
 
     # We convert Hugging Face Attention layers to automatically search over the number of heads
 
@@ -276,7 +276,7 @@ def calibrate_loop(model):
         mtq_cfg["quant_cfg"]["*attn*weight_quantizer"] = {"num_bits": (4, 3), "axis": None}
         mtq_cfg["quant_cfg"]["*attn*input_quantizer"] = {"num_bits": (4, 3), "axis": None}
 
-    if args.enable_wo_quant and "FP4" in quant_cfg:
+    if not args.disable_wo_quant and "FP4" in quant_cfg:
         mtq_cfg["quant_cfg"]["*wo*weight_quantizer"] = mtq_cfg["quant_cfg"]["*input_quantizer"]
         mtq_cfg["quant_cfg"]["*wo*input_quantizer"] = mtq_cfg["quant_cfg"]["*weight_quantizer"]
     ## ptq
@@ -287,7 +287,7 @@ def calibrate_loop(model):
     return model
 
 
-def save_amax_and_quant_config(model, output_path: str, enable_fp8_kvcache: bool):
+def save_amax_and_quant_config(model, output_path: str, enable_fp8_kvcache: bool = True):
     """Saves the amax values of the model to the output path."""
     world_size = int(os.getenv("WORLD_SIZE", "1"))
     rank = int(os.getenv("RANK", "0"))
@@ -353,8 +353,8 @@ def state_dict_filter(state_dict):
     )
     parser.add_argument("--batch_size", type=int, default=8, help="batch size for quantization.")
     parser.add_argument("--calib_size", type=int, default=512, help="samples for calibration.")
-    parser.add_argument("--enable_fp8_kvcache", type=bool, default=True, help="enable fp8 kvcache.")
-    parser.add_argument("--enable_wo_quant", action="store_true", help="enable MLA wo quant.")
+    parser.add_argument("--disable_fp8_kvcache", action="store_true", help="disable fp8 kvcache.")
+    parser.add_argument("--disable_wo_quant", action="store_true", help="disable MLA wo quant.")
     parser.add_argument("--trust_remote_code", action="store_true", help="trust remote code.")
 
     args = parser.parse_args()
@@ -363,4 +363,4 @@ def state_dict_filter(state_dict):
         args.model_path, trust_remote_code=args.trust_remote_code
     )
     model = ptq(model, tokenizer, args.quant_cfg, args.batch_size, args.calib_size)
-    save_amax_and_quant_config(model, args.output_path, args.enable_fp8_kvcache)
+    save_amax_and_quant_config(model, args.output_path, not args.disable_fp8_kvcache)
@@ -140,7 +140,7 @@ Loss balancers:
 
 Checkout the stand-alone distillation script in the [NVIDIA NeMo repository](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/distillation/distillation.html).
 
-You can also look at the tutorial notebooks [here](https://github.com/NVIDIA/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
+You can also look at the tutorial notebooks [here](https://github.com/NVIDIA-NeMo/NeMo/tree/main/tutorials/llm/llama/pruning-distillation) which showcase the usage of Minitron pruning followed by distillation for Llama 3.1 8B step-by-step in NeMo framework.
 
 ## Knowledge Distillation (KD) for HuggingFace Models
 
 
@@ -119,7 +119,6 @@ def run_eval(
     dtype,
     revision,
     engine_dir,
-    vocab_file,
     nim_model,
     args,
 ):
@@ -152,7 +151,6 @@ def run_eval(
             top_p=top_p,
             temperature=temperature,
             engine_dir=engine_dir,
-            vocab_file=vocab_file,
             nim_model=nim_model,
         )
         for i in range(0, len(questions), chunk_size)
@@ -177,18 +175,11 @@ def get_model_answers(
     top_p=None,
     temperature=None,
     engine_dir=None,
-    vocab_file=None,
     nim_model=None,
 ):
     # Model Optimizer modification
     if engine_dir:
-        if vocab_file:
-            from modelopt.deploy.llm.nemo_utils import get_nemo_tokenizer
-
-            tokenizer = get_nemo_tokenizer(vocab_file)
-        else:
-            model_ckpt_path = model_path
-            tokenizer = get_tokenizer(model_ckpt_path, trust_remote_code=args.trust_remote_code)
+        tokenizer = get_tokenizer(model_path, trust_remote_code=args.trust_remote_code)
         if engine_dir:
             # get model type
             last_part = os.path.basename(engine_dir)
@@ -440,11 +431,6 @@ def reorg_answer_file(answer_file):
         type=str,
         help="The path to the TensorRT LLM engine directory.",
     )
-    parser.add_argument(
-        "--vocab-file",
-        type=str,
-        help="The path to the vocabulary file.",
-    )
     parser.add_argument(
         "--nim-model",
         type=str,
@@ -517,7 +503,6 @@ def reorg_answer_file(answer_file):
         dtype=str_to_torch_dtype(args.dtype),
         revision=args.revision,
         engine_dir=args.engine_dir,
-        vocab_file=args.vocab_file,
         nim_model=args.nim_model,
         args=args,
     )
 
@@ -250,15 +250,8 @@ def main(
     # Model Optimizer modification
     # Enable automatic save/load of modelopt state huggingface checkpointing
     mto.enable_huggingface_checkpointing()
-    if vocab_file := kwargs.get("vocab_file"):
-        from modelopt.deploy.llm.nemo_utils import get_nemo_tokenizer
-
-        tokenizer = get_nemo_tokenizer(vocab_file)
-    else:
-        model_ckpt_path = kwargs["model_path"]
-        tokenizer = get_tokenizer(
-            model_ckpt_path, trust_remote_code=kwargs.get("trust_remote_code", False)
-        )
+    model_path = kwargs["model_path"]
+    tokenizer = get_tokenizer(model_path, trust_remote_code=kwargs.get("trust_remote_code", False))
     if kwargs.get("engine_dir"):
         # get model type
         last_part = os.path.basename(kwargs["engine_dir"])
 
@@ -105,45 +105,26 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
 
 ## Support Matrix
 
-### Supported Models
+### Hugging Face Supported Models
 
 | Model | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>5</sup> |
 | :---: | :---: | :---: | :---: | :---: | :---: |
-| GPTJ | ✅ | ✅ | ✅ | ✅ | - |
-| LLAMA 2 | ✅ | ✅ | ✅ | ✅ | - |
-| LLAMA 3, 3.1, 3.3 | ✅ | ❌ | ✅ | ✅<sup>3</sup> | ✅ |
+| LLAMA 3.x | ✅ | ❌ | ✅ | ✅<sup>3</sup> | ✅ |
 | LLAMA 4 <sup>6</sup> | ✅ | ❌ | ❌ | ❌ | ✅ |
-| LLAMA 2 (Nemo) | ✅ | ✅ | ✅ | ✅ | - |
-| CodeLlama | ✅ | ✅ | ✅ | ❌ | - |
-| Mistral | ✅ | ✅ | ✅ | ❌ | ✅ |
-| Mixtral 8x7B, 8x22B | ✅ | ❌ | ✅<sup>2</sup> | ❌ | ✅ |
-| Snowflake Arctic<sup>2</sup> | ✅ | ❌ | ✅ | ❌ | - |
-| Falcon 40B, 180B | ✅ | ✅ | ✅ | ✅ | - |
-| Falcon 7B | ✅ | ✅ | ❌ | ❌ | - |
-| MPT 7B, 30B | ✅ | ✅ | ✅ | ✅ | - |
-| Baichuan 1, 2 | ✅ | ✅ | ✅ | ✅ | - |
-| ChatGLM2, 3 6B | ❌ | ❌ | ✅ | ❌ | - |
-| Bloom | ✅ | ✅ | ✅ | ✅ | - |
-| Phi-1,2,3,4 | ✅ | ✅ | ✅ | ✅<sup>3</sup> | - |
+| Mixtral | ✅ | ❌ | ✅<sup>2</sup> | ❌ | ✅ |
+| Phi-3,4 | ✅ | ✅ | ✅ | ✅<sup>3</sup> | - |
 | Phi-3.5 MOE | ✅ | ❌ | ❌ | ❌ | - |
 | Llama-Nemotron Super | ✅ | ❌ | ❌ | ❌ | ✅ |
 | Llama-Nemotron Ultra | ✅ | ❌ | ❌ | ❌ | ❌ |
-| Nemotron 8B | ✅ | ❌ | ✅ | ❌ | - |
-| Gemma 2B, 7B | ✅ | ❌ | ✅ | ✅ | - |
-| Gemma 3 1B | ✅<sup>2</sup> | ❌ | ✅ | ❌ | - |
-| RecurrentGemma 2B | ✅ | ✅ | ✅ | ❌ | - |
-| StarCoder 2 | ✅ | ✅ | ✅ | ❌ | - |
+| Gemma 3 | ✅<sup>2</sup> | - | ✅ | - | - |
 | QWen 2, 2.5 <sup>4</sup> | ✅ | ✅ | ✅ | ✅ | ✅ |
-| QWen MOE | ✅ | - | - | - | ✅ |
 | QWen3 MOE <sup>6</sup> | ✅ | - | - | - | ✅ |
 | QwQ | ✅ | - | - | - | ✅ |
-| DBRX | ✅ | ❌ | ❌ | ❌ | - |
-| InternLM2 | ✅ | ❌ | ✅ | ✅<sup>3</sup> | - |
-| Exaone | ✅ | ✅ | ✅ | ✅ | - |
-| Minitron | ✅ | ✅ | ✅ | ✅<sup>2</sup> | ✅ |
 | T5 | ✅ | ✅ | ✅ | ✅ | - |
 | Whisper | ✅ | ❌ | ❌ | ❌ | - |
 
+> *This is a subset of the models supported. For the full list please check the [TensorRT-LLM support matrix](https://nvidia.github.io/TensorRT-LLM/reference/precision.html#support-matrix)*
+
 > *<sup>1.</sup>The w4a8_awq is an experimental quantization scheme that may result in a higher accuracy penalty.* \
 > *<sup>2.</sup>For some models, there is only support for exporting quantized checkpoints.* \
 > *<sup>3.</sup>W4A8_AWQ is only available on some models but not all* \
@@ -155,6 +136,10 @@ Please reference our [framework scripts](#framework-scripts) and our [docs](http
 
 > You can also create your own custom config using [this](https://nvidia.github.io/TensorRT-Model-Optimizer/guides/_pytorch_quantization.html#custom-calibration-algorithm) guide.
 
+### NeMo Supported Models
+
+Please refer to the [NeMo 2.0 PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html#support-matrix) for supported models.
+
 ## AutoQuantize
 
 [AutoQuantize (`mtq.auto_quantize`)](https://nvidia.github.io/TensorRT-Model-Optimizer/reference/generated/modelopt.torch.quantization.model_quant.html#modelopt.torch.quantization.model_quant.auto_quantize) is a PTQ algorithm which quantizes a model by searching for the best quantization format per-layer while meeting performance constraints specified by the user. `AutoQuantize` streamlines the trade-off of model accuracy and performance.
@@ -224,18 +209,6 @@ The example scripts above also have an additional flag `--tasks`, where the actu
 
 > *NOTE: AutoQuantize requires backpropagation of the model. Models without backpropagation support (e.g., Llama-4) will not work with AutoQuantize.*
 
-### AutoQuantize for NeMo models
-
-The usage is similar for NeMo models to perform `AutoQuantize`. Please refer to the [NeMo Example Script](#nemo-example-script) section for the full setup instructions.
-
-[Script](./scripts/nemo_example.sh)
-
-```bash
-# --auto_quantize_bits specifies the constraint for `AutoQuantize`
-# --quant specifies the formats to be searched for `AutoQuantize`. Multiple formats can be searched over by passing them as comma separated values
-scripts/nemo_example.sh --type gpt --model $GPT_MODEL_FILE --quant fp8,int4_awq --auto_quantize_bits 6.4 --tp [1|2|4|8]
-```
-
 ## Real Quant
 
 When working with large language models, memory constraints can be a significant challenge. ModelOpt provides a workflow for initializing HF models with compressed weights across multiple GPUs to dramatically reduce memory usage. Check `--low_memory_mode` option in hf_ptq.py for more details.
@@ -280,27 +253,17 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_
 
 > *You can now add `--low_memory_mode` to the command when setting `--export_fmt=hf` to lower the memory requirements of the PTQ process. With this mode, the script will compress model weights to low precision before calibration. This mode is only supported for FP8 and NVFP4 with max calibration.*
 
-#### Llama 4
-
-We support FP8 and NVFP4 quantized Llama 4 model Hugging Face checkpoint export using the following command:
-
-```bash
-python hf_ptq.py --pyt_ckpt_path=<llama4 model path> --export_path=<quantized hf checkpoint> --qformat=[fp8|nvfp4] --export_fmt=hf
-```
-
-The quantized checkpoint can be deployed following the TensorRT-LLM instructions. Note since we only quantize the language model in Llama 4, the exported config has `Llama4ForCausalLM`, but TensorRT-LLM expects `Llama4ForConditionalGeneration` which is from the original Llama 4. Therefore our script will copy over the original config files to the exported checkpoint folder.
-
 #### Deepseek R1
 
 [PTQ for DeepSeek](../deepseek/README.md) shows how to quantize the DeepSeek model with FP4 and export to TensorRT-LLM.
 
-### NeMo Example [Script](./scripts/nemo_example.sh)
+### NeMo Example Script
 
-Please refer to the [NeMo PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html) for more details.
+NeMo 2.0 framework PTQ and TensorRT-LLM deployment examples are maintained in the NeMo GitHub repo. Please refer to the [NeMo PTQ documentation](https://docs.nvidia.com/nemo-framework/user-guide/latest/model-optimization/quantization/quantization.html) for more details.
 
 ### Megatron-LM Example Script
 
-Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples [here](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/export).
+Megatron-LM framework PTQ and TensorRT-LLM deployment examples are maintained in the Megatron-LM GitHub repo. Please refer to the examples [here](https://github.com/NVIDIA/Megatron-LM/tree/main/examples/post_training/modelopt).
 
 ## Evaluate Accuracy