NVIDIA
diff --git a/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 1 deletion b/‎.github/CODEOWNERS‎
Lines changed: 2 additions & 1 deletion
diff --git a/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions b/‎CHANGELOG.rst‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎README.md‎
Lines changed: 5 additions & 4 deletions b/‎README.md‎
Lines changed: 5 additions & 4 deletions
diff --git a/‎examples/chained_optimizations/bert_prune_distill_quantize.py‎
Lines changed: 1 addition & 0 deletions b/‎examples/chained_optimizations/bert_prune_distill_quantize.py‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/diffusers/quantization/diffusion_trt.py‎
Lines changed: 16 additions & 2 deletions b/‎examples/diffusers/quantization/diffusion_trt.py‎
Lines changed: 16 additions & 2 deletions
diff --git a/‎examples/diffusers/quantization/quantize.py‎
Lines changed: 31 additions & 0 deletions b/‎examples/diffusers/quantization/quantize.py‎
Lines changed: 31 additions & 0 deletions
diff --git a/‎examples/gpt-oss/README.md‎
Lines changed: 2 additions & 0 deletions b/‎examples/gpt-oss/README.md‎
Lines changed: 2 additions & 0 deletions
diff --git a/‎examples/gpt-oss/convert_oai_mxfp4_weight_only.py‎
Lines changed: 13 additions & 9 deletions b/‎examples/gpt-oss/convert_oai_mxfp4_weight_only.py‎
Lines changed: 13 additions & 9 deletions
@@ -43,8 +43,9 @@ examples/llm_eval @NVIDIA/modelopt-examples-llm_ptq-codeowners
 examples/llm_ptq @NVIDIA/modelopt-examples-llm_ptq-codeowners
 examples/llm_qat @NVIDIA/modelopt-examples-llm_qat-codeowners
 examples/llm_sparsity @NVIDIA/modelopt-torch-sparsity-codeowners
+examples/megatron-lm @NVIDIA/modelopt-examples-megatron-codeowners
 examples/model_hub @NVIDIA/modelopt-examples-model_hub-codeowners
-examples/nemo_run @NVIDIA/modelopt-examples-nemo_run-codeowners
+examples/nemo_run @NVIDIA/modelopt-examples-megatron-codeowners
 examples/onnx_ptq @NVIDIA/modelopt-onnx-codeowners
 examples/pruning @NVIDIA/modelopt-torch-nas-prune-codeowners
 examples/speculative_decoding @NVIDIA/modelopt-torch-speculative-codeowners
 
@@ -26,6 +26,7 @@ Model Optimizer Changelog (Linux)
 - Add support for ``mamba_num_heads``, ``mamba_head_dim``, ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba or Hybrid Transformer Mamba models in ``mcore_minitron`` (previously ``mcore_gpt_minitron``) mode.
 - Add example for QAT/QAD training with `LLaMA Factory <https://github.com/hiyouga/LLaMA-Factory/tree/main>`_. See ``examples/llm_qat/llama_factory`` for more details.
 - Upgrade TensorRT-LLM dependency to 1.0.0rc6.
+- Add unified HuggingFace model export support for quantized NVFP4 GPT-OSS models.
 
 0.33 (2025-07-14)
 ^^^^^^^^^^^^^^^^^
 
@@ -18,6 +18,7 @@
 
 ## Latest News
 
+- [2025/08/29] [Fine-Tuning gpt-oss for Accuracy and Performance with Quantization Aware Training](https://developer.nvidia.com/blog/fine-tuning-gpt-oss-for-accuracy-and-performance-with-quantization-aware-training/)
 - [2025/08/01] [Optimizing LLMs for Performance and Accuracy with Post-Training Quantization](https://developer.nvidia.com/blog/optimizing-llms-for-performance-and-accuracy-with-post-training-quantization/)
 - [2025/06/24] [Introducing NVFP4 for Efficient and Accurate Low-Precision Inference](https://developer.nvidia.com/blog/introducing-nvfp4-for-efficient-and-accurate-low-precision-inference/)
 - [2025/05/14] [NVIDIA TensorRT Unlocks FP4 Image Generation for NVIDIA Blackwell GeForce RTX 50 Series GPUs](https://developer.nvidia.com/blog/nvidia-tensorrt-unlocks-fp4-image-generation-for-nvidia-blackwell-geforce-rtx-50-series-gpus/)
@@ -29,14 +30,14 @@
 - [2025/01/28] Model Optimizer is now open source!
 - [2024/10/23] Model Optimizer quantized FP8 Llama-3.1 Instruct models available on Hugging Face for download: [8B](https://huggingface.co/nvidia/Llama-3.1-8B-Instruct-FP8), [70B](https://huggingface.co/nvidia/Llama-3.1-70B-Instruct-FP8), [405B](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP8).
 - [2024/09/10] [Post-Training Quantization of LLMs with NVIDIA NeMo and TensorRT Model Optimizer](https://developer.nvidia.com/blog/post-training-quantization-of-llms-with-nvidia-nemo-and-nvidia-tensorrt-model-optimizer/).
-- [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
-- [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
-- [2024/08/15] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
-- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
 
 <details close>
 <summary>Previous News</summary>
 
+- [2024/08/28] [Boosting Llama 3.1 405B Performance up to 44% with TensorRT Model Optimizer on NVIDIA H200 GPUs](https://developer.nvidia.com/blog/boosting-llama-3-1-405b-performance-by-up-to-44-with-nvidia-tensorrt-model-optimizer-on-nvidia-h200-gpus/)
+- [2024/08/28] [Up to 1.9X Higher Llama 3.1 Performance with Medusa](https://developer.nvidia.com/blog/low-latency-inference-chapter-1-up-to-1-9x-higher-llama-3-1-performance-with-medusa-on-nvidia-hgx-h200-with-nvlink-switch/)
+- [2024/08/15] New features in recent releases: [Cache Diffusion](https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/diffusers/cache_diffusion), [QLoRA workflow with NVIDIA NeMo](https://docs.nvidia.com/nemo-framework/user-guide/24.09/sft_peft/qlora.html), and more. Check out [our blog](https://developer.nvidia.com/blog/nvidia-tensorrt-model-optimizer-v0-15-boosts-inference-performance-and-expands-model-support/) for details.
+- [2024/06/03] Model Optimizer now has an experimental feature to deploy to vLLM as part of our effort to support popular deployment frameworks. Check out the workflow [here](./examples/llm_ptq/README.md#deploy-fp8-quantized-model-using-vllm)
 - [2024/05/08] [Announcement: Model Optimizer Now Formally Available to Further Accelerate GenAI Inference Performance](https://developer.nvidia.com/blog/accelerate-generative-ai-inference-performance-with-nvidia-tensorrt-model-optimizer-now-publicly-available/)
 - [2024/03/27] [Model Optimizer supercharges TensorRT-LLM to set MLPerf LLM inference records](https://developer.nvidia.com/blog/nvidia-h200-tensor-core-gpus-and-nvidia-tensorrt-llm-set-mlperf-llm-inference-records/)
 - [2024/03/18] [GTC Session: Optimize Generative AI Inference with Quantization in TensorRT-LLM and TensorRT](https://www.nvidia.com/en-us/on-demand/session/gtc24-s63213/)
 
@@ -1107,6 +1107,7 @@ def main(input_args: list[str] | None = None) -> None:
         format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",
         datefmt="%m/%d/%Y %H:%M:%S",
         level=logging.INFO,
+        force=True,
     )
     logger.info(accelerator.state, main_process_only=False)
     if accelerator.is_local_main_process:
 
@@ -22,7 +22,7 @@
     remove_nesting,
     update_dynamic_axes,
 )
-from quantize import create_pipeline
+from quantize import ModelType, PipelineManager
 
 import modelopt.torch.opt as mto
 from modelopt.torch._deploy._runtime import RuntimeRegistry
@@ -31,6 +31,20 @@
 from modelopt.torch._deploy.device_model import DeviceModel
 from modelopt.torch._deploy.utils import get_onnx_bytes_and_metadata
 
+MODEL_ID = {
+    "sdxl-1.0": ModelType.SDXL_BASE,
+    "sdxl-turbo": ModelType.SDXL_TURBO,
+    "sd3-medium": ModelType.SD3_MEDIUM,
+    "flux-dev": ModelType.FLUX_DEV,
+    "flux-schnell": ModelType.FLUX_SCHNELL,
+}
+
+dtype_map = {
+    "Half": torch.float16,
+    "BFloat16": torch.bfloat16,
+    "Float": torch.float32,
+}
+
 
 def generate_image(pipe, prompt, image_name):
     seed = 42
@@ -91,7 +105,7 @@ def main():
 
     image_name = args.save_image_as if args.save_image_as else f"{args.model}.png"
 
-    pipe = create_pipeline(args.model, args.model_dtype, args.override_model_path)
+    pipe = PipelineManager.create_pipeline_from(MODEL_ID[args.model], dtype_map[args.model_dtype])
 
     # Save the backbone of the pipeline and move it to the GPU
     add_embedding = None
 
@@ -306,6 +306,37 @@ def __init__(self, config: ModelConfig, logger: logging.Logger):
         self.pipe: DiffusionPipeline | None = None
         self.pipe_upsample: LTXLatentUpsamplePipeline | None = None  # For LTX-Video upsampling
 
+    @staticmethod
+    def create_pipeline_from(
+        model_type: ModelType, torch_dtype: torch.dtype = torch.bfloat16
+    ) -> DiffusionPipeline:
+        """
+        Create and return an appropriate pipeline based on configuration.
+
+        Returns:
+            Configured diffusion pipeline
+
+        Raises:
+            ValueError: If model type is unsupported
+        """
+        try:
+            model_id = MODEL_REGISTRY[model_type]
+            if model_type == ModelType.SD3_MEDIUM:
+                pipe = StableDiffusion3Pipeline.from_pretrained(model_id, torch_dtype=torch_dtype)
+            elif model_type in [ModelType.FLUX_DEV, ModelType.FLUX_SCHNELL]:
+                pipe = FluxPipeline.from_pretrained(model_id, torch_dtype=torch_dtype)
+            else:
+                # SDXL models
+                pipe = DiffusionPipeline.from_pretrained(
+                    model_id,
+                    torch_dtype=torch_dtype,
+                    use_safetensors=True,
+                )
+            pipe.set_progress_bar_config(disable=True)
+            return pipe
+        except Exception as e:
+            raise e
+
     def create_pipeline(self) -> DiffusionPipeline:
         """
         Create and return an appropriate pipeline based on configuration.
 
@@ -49,6 +49,8 @@ model = mtq.quantize(model, config, forward_loop)
 train(model, train_loader, optimizer, scheduler, ...)
 ```
 
+For an end to end example showcasing the above workflow, checkout [qat-finetune-transformers.ipynb](/examples/gpt-oss/qat-finetune-transformers.ipynb).
+
 If you are training Huggingface models with trainer classes from Huggingface such as [SFTTrainer](https://huggingface.co/docs/trl/en/sft_trainer) performing QAT is even easier - simply replace the trainer with its equivalent, `QATSFTTrainer` from ModelOpt and specify additional quantization arguments to it. `QATSFTTrainer` will perform the necessary quantization steps in the backend and train the model just like the original `SFTTrainer`.
 
 A real end-to-end example for this is in `sft.py` in this folder. To perform QAT with full parameter SFT on GPT-OSS 20B model, run:
 
@@ -23,11 +23,8 @@
 from transformers import AutoModelForCausalLM, AutoTokenizer, Mxfp4Config
 from utils import get_original_huggingface_quant_method
 
-import modelopt.torch.opt as mto
 from modelopt.torch.quantization.qtensor import MXFP4QTensor
 
-mto.enable_huggingface_checkpointing()
-
 
 def _to_oai_mxfp4_weight_only(model, block_size=32):
     new_state_dict = {}
@@ -36,15 +33,20 @@ def _to_oai_mxfp4_weight_only(model, block_size=32):
         # Only convert experts weights, skip bias and other modules
         if "experts" in name and "bias" not in name:
             param = param.transpose(-1, -2).contiguous()
-            quantized, scales = MXFP4QTensor.quantize(param, block_size=block_size)
-
-            shape = quantized._quantized_data.shape
+            quantized_tensors = []
+            scales_tensors = []
+            for expert in param:
+                quantized, scales = MXFP4QTensor.quantize(expert, block_size=block_size)
+                quantized_tensors.append(quantized._quantized_data)
+                scales_tensors.append(scales)
+            quantized = torch.stack(quantized_tensors)
+            scales = torch.stack(scales_tensors)
+
+            shape = quantized.shape
             # Add converted weights and scales to state_dict
             new_state_dict.update(
                 {
-                    f"{name}_blocks": quantized._quantized_data.view(
-                        shape[0], shape[1], -1, block_size // 2
-                    ).cpu(),
+                    f"{name}_blocks": quantized.view(shape[0], shape[1], -1, block_size // 2).cpu(),
                     f"{name}_scales": scales.view(shape[0], shape[1], -1).cpu(),
                 }
             )
@@ -134,6 +136,8 @@ def create_parser():
     if args.lora_path:
         model = PeftModel.from_pretrained(model, args.lora_path)
         model = model.merge_and_unload()  # Merge LoRA-QAT adapter weights to base model
+        torch.cuda.empty_cache()
+        gc.collect()
 
     # Load tokenizer
     tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
Original file line number	Diff line number	Diff line change
`@@ -1107,6 +1107,7 @@ def main(input_args: list[str] \| None = None) -> None:`
`1107`	`1107`	`format="%(asctime)s [%(levelname)s] %(name)s - %(message)s",`
`1108`	`1108`	`datefmt="%m/%d/%Y %H:%M:%S",`
`1109`	`1109`	`level=logging.INFO,`
	`1110`	`+ force=True,`
`1110`	`1111`	`)`
`1111`	`1112`	`logger.info(accelerator.state, main_process_only=False)`
`1112`	`1113`	`if accelerator.is_local_main_process:`