NVIDIA
diff --git a/‎CHANGELOG.rst‎
Lines changed: 3 additions & 1 deletion b/‎CHANGELOG.rst‎
Lines changed: 3 additions & 1 deletion
diff --git a/‎README.md‎
Lines changed: 1 addition & 0 deletions b/‎README.md‎
Lines changed: 1 addition & 0 deletions
diff --git a/‎examples/diffusers/quantization/calib/plugin_calib.py‎
Lines changed: 1 addition & 7 deletions b/‎examples/diffusers/quantization/calib/plugin_calib.py‎
Lines changed: 1 addition & 7 deletions
diff --git a/‎examples/llm_ptq/README.md‎
Lines changed: 13 additions & 0 deletions b/‎examples/llm_ptq/README.md‎
Lines changed: 13 additions & 0 deletions
diff --git a/‎examples/llm_ptq/example_utils.py‎
Lines changed: 24 additions & 1 deletion b/‎examples/llm_ptq/example_utils.py‎
Lines changed: 24 additions & 1 deletion
diff --git a/‎examples/llm_ptq/hf_ptq.py‎
Lines changed: 60 additions & 59 deletions b/‎examples/llm_ptq/hf_ptq.py‎
Lines changed: 60 additions & 59 deletions
diff --git a/‎examples/llm_ptq/scripts/huggingface_example.sh‎
Lines changed: 5 additions & 2 deletions b/‎examples/llm_ptq/scripts/huggingface_example.sh‎
Lines changed: 5 additions & 2 deletions
diff --git a/‎examples/llm_ptq/scripts/parser.sh‎
Lines changed: 1 addition & 1 deletion b/‎examples/llm_ptq/scripts/parser.sh‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎examples/speculative_decoding/launch.sh‎
Lines changed: 0 additions & 10 deletions b/‎examples/speculative_decoding/launch.sh‎
Lines changed: 0 additions & 10 deletions
@@ -10,14 +10,16 @@ Model Optimizer Changelog (Linux)
 
 **New Features**
 
-- New model support in the ``llm_ptq`` example: OpenAI Whisper.
+- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
 - Blockwise FP8 quantization support in unified model export.
 - Add quantization support to the Transformer Engine Linear module.
 - Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
 - To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
 - Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
 - Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
 - Add option to simplify ONNX model before quantization is performed.
+- Add FP4 KV cache support for unified HF and TensorRT-LLM export.
+- Add speculative decoding support to Multi-Token Prediction (MTP) in Megatron Core models.
 - (Experimental) Improve support for ONNX models with custom TensorRT op:
    - Add support for ``--calibration_shapes`` flag.
    - Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.
 
@@ -18,6 +18,7 @@
 
 ## Latest News
 
+- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
 - [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
 - [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
 - [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).
 
@@ -38,13 +38,7 @@ def collect(self, x):
             RuntimeError: If amax shape changes
         """
         # Swap axis to reduce.
-        axis = self._axis if isinstance(self._axis, (list, tuple)) else [self._axis]
-        # Handle negative axis.
-        axis = [x.dim() + i if isinstance(i, int) and i < 0 else i for i in axis]
-        reduce_axis = []
-        for i in range(x.dim()):
-            if i not in axis:
-                reduce_axis.append(i)
+        reduce_axis = quant_utils.convert_quantization_axis_to_reduce_axis(x, self._axis)
         local_amax = quant_utils.reduce_amax(x, axis=reduce_axis).detach()
         _cur_step = self.i % self.total_step
         if _cur_step not in self.data.keys():
 
@@ -56,6 +56,16 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_
 
 > *Calibration by default uses left padding_side for the Huggingface tokenizer as it usually leads to lower accuracy loss. The exported tokenizer files restores the default padding_side.*
 
+#### Llama 4
+
+We support FP8 and NVFP4 quantized Llama 4 model Hugging Face checkpoint export using the following command:
+
+```bash
+python hf_ptq.py --pyt_ckpt_path=<llama4 model path> --export_path=<quantized hf checkpoint> --qformat=[fp8|nvfp4] --export_fmt=hf
+```
+
+The quantized checkpoint can be deployed following the TensorRT-LLM instructions.
+
 #### For NeMo models like [nemotron](https://huggingface.co/nvidia/nemotron-3-8b-base-4k):
 
 NeMo PTQ requires the NeMo package installed. It's recommended to start from the NeMo containers like `nvcr.io/nvidia/nemo:24.07` or latest `nvcr.io/nvidia/nemo:dev` directly.
@@ -91,6 +101,7 @@ Model | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>5</sup> |
 GPTJ | Yes | Yes | Yes | Yes | -
 LLAMA 2 | Yes | Yes | Yes | Yes | -
 LLAMA 3, 3.1, 3.3 | Yes | No | Yes | Yes<sup>3</sup> | Yes
+LLAMA 4 | Yes | No | No | No | Yes
 LLAMA 2 (Nemo) | Yes | Yes | Yes | Yes | -
 CodeLlama | Yes | Yes | Yes | No | -
 Mistral | Yes | Yes | Yes | No | Yes
@@ -110,6 +121,8 @@ Gemma 2 9B, 27B | Yes<sup>2</sup> | No | Yes | No | -
 RecurrentGemma 2B | Yes | Yes | Yes | No | -
 StarCoder 2 | Yes | Yes | Yes | No | -
 QWen 2, 2.5 <sup>4</sup> | Yes | Yes | Yes | Yes | Yes
+QWen MOE | Yes | - | - | - | Yes
+QwQ | Yes | - | - | - | Yes
 DBRX | Yes | No | No | No | -
 InternLM2 | Yes | No | Yes | Yes<sup>3</sup> | -
 Exaone | Yes | Yes | Yes | Yes | -
 
@@ -21,6 +21,15 @@
 
 from modelopt.torch.utils.image_processor import MllamaImageProcessor
 
+SPECULATIVE_MODEL_LIST = ["Eagle", "Medusa"]
+
+
+def is_speculative(hf_config):
+    for name in SPECULATIVE_MODEL_LIST:
+        if name in hf_config.architectures[0]:
+            return True
+    return False
+
 
 def get_mode_type_from_engine_dir(engine_dir_str):
     # Split the path by '/' and get the last part
@@ -134,7 +143,14 @@ def get_model(ckpt_path, device="cuda", gpu_mem_percentage=0.8, trust_remote_cod
     else:
         hf_config = AutoConfig.from_pretrained(ckpt_path, trust_remote_code=trust_remote_code)
 
-        if hf_config.model_type == "llava":
+        if is_speculative(hf_config):
+            model = AutoModelForCausalLM.from_pretrained(
+                ckpt_path,
+                device_map=device_map,
+                **model_kwargs,
+                trust_remote_code=trust_remote_code,
+            )
+        elif hf_config.model_type == "llava":
             from transformers import LlavaForConditionalGeneration
 
             hf_llava = LlavaForConditionalGeneration.from_pretrained(
@@ -175,6 +191,13 @@ def get_model(ckpt_path, device="cuda", gpu_mem_percentage=0.8, trust_remote_cod
                 **model_kwargs,
                 trust_remote_code=trust_remote_code,
             )
+        elif hf_config.model_type == "llama4":
+            model = AutoModelForCausalLM.from_pretrained(
+                ckpt_path,
+                device_map=device_map,
+                **model_kwargs,
+                trust_remote_code=trust_remote_code,
+            )
 
         else:
             from accelerate import infer_auto_device_map, init_empty_weights
 
@@ -21,6 +21,7 @@
 
 import numpy as np
 import torch
+from accelerate.hooks import remove_hook_from_module
 from example_utils import get_model, get_processor, get_tokenizer, is_enc_dec, is_model_on_gpu
 from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast, WhisperProcessor
 
@@ -97,8 +98,9 @@ def auto_quantize(
         verbose=True,
         disabled_layers=["*lm_head*"],
     )
+
     # We need to explicitly calibrate for kv cache quantization
-    enable_quant_kv_cache = args.kv_cache_qformat not in ["", "none"]
+    enable_quant_kv_cache = args.kv_cache_qformat != "none"
     print(f"{'Enable' if enable_quant_kv_cache else 'Disable'} KV cache quantization")
     if enable_quant_kv_cache:
         kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
@@ -262,8 +264,7 @@ def main(args):
         )
         mts.export(model)
 
-    enable_quant_kv_cache = args.kv_cache_qformat not in ["", "none"]
-    if args.qformat or enable_quant_kv_cache:
+    if args.auto_quantize_bits or args.qformat in QUANT_CFG_CHOICES:
         # If any qformat provided is not fp8, assert model is on GPU
         if args.qformat not in ["fp8", "nvfp4"]:
             assert is_model_on_gpu(model), (
@@ -348,12 +349,11 @@ def main(args):
 
         quant_cfg = {}
         if not args.auto_quantize_bits:
-            assert args.qformat in QUANT_CFG_CHOICES or enable_quant_kv_cache, (
+            assert args.qformat in QUANT_CFG_CHOICES, (
                 f"Unsupported quantization format: {args.qformat} with {args.kv_cache_qformat} KV cache"
             )
 
-            if args.qformat in QUANT_CFG_CHOICES:
-                quant_cfg = getattr(mtq, QUANT_CFG_CHOICES[args.qformat])
+            quant_cfg = getattr(mtq, QUANT_CFG_CHOICES[args.qformat])
 
             if "awq" in args.qformat:
                 quant_cfg = copy.deepcopy(getattr(mtq, QUANT_CFG_CHOICES[args.qformat]))
@@ -368,6 +368,7 @@ def main(args):
                 if "w4a8_awq" == args.qformat and model_type in ["gemma", "mpt"]:
                     quant_cfg["algorithm"] = {"method": "awq_lite", "alpha_step": 1}
 
+            enable_quant_kv_cache = args.kv_cache_qformat != "none"
             print(f"{'Enable' if enable_quant_kv_cache else 'Disable'} KV cache quantization")
 
             # Check if any bmm_quantizer is in the quant_cfg. If so, we need to enable the bmm_quantizer.
@@ -391,18 +392,24 @@ def main(args):
         input_ids = next(iter(calib_dataloader))[
             "input_features" if model_type == "whisper" else "input_ids"
         ][0:1]
-        with torch.autocast("cuda"):
-            generated_ids_before_ptq = model.generate(input_ids, max_new_tokens=100)
+        generated_ids_before_ptq = model.generate(input_ids, max_new_tokens=100)
 
-            model = quantize_model(model, quant_cfg, args, calib_dataloader)
-            if args.compress:
-                mtq.compress(model)
+        model = quantize_model(model, quant_cfg, args, calib_dataloader)
+        if args.compress:
+            mtq.compress(model)
             # Lets print the quantization summary
-            if args.verbose:
-                mtq.print_quant_summary(model)
+        if args.verbose:
+            mtq.print_quant_summary(model)
 
-            # Run some samples
+        # Run some samples
+        torch.cuda.empty_cache()
+        generated_ids_after_ptq = None
+        if model_type != "llama4":
             generated_ids_after_ptq = model.generate(input_ids, max_new_tokens=100)
+        else:
+            warnings.warn(
+                "Llama4 Maverick generation after quantization has a bug. Skipping generation sample."
+            )
 
         def input_decode(input_ids):
             if processor is not None and isinstance(processor, MllamaImageProcessor):
@@ -429,20 +436,21 @@ def output_decode(generated_ids, input_shape):
             else:
                 raise ValueError("The processor or tokenizer must be set")
 
-        print("--------")
-        print(f"example test input: {input_decode(input_ids)}")
-        print("--------")
-        print(
-            f"example outputs before ptq: {output_decode(generated_ids_before_ptq, input_ids.shape[1])}"
-        )
-        print("--------")
-        print(
-            f"example outputs after ptq: {output_decode(generated_ids_after_ptq, input_ids.shape[1])}"
-        )
+        if generated_ids_after_ptq is not None:
+            print("--------")
+            print(f"example test input: {input_decode(input_ids)}")
+            print("--------")
+            print(
+                f"example outputs before ptq: {output_decode(generated_ids_before_ptq, input_ids.shape[1])}"
+            )
+            print("--------")
+            print(
+                f"example outputs after ptq: {output_decode(generated_ids_after_ptq, input_ids.shape[1])}"
+            )
 
     else:
         assert model_type != "dbrx", f"Does not support export {model_type} without quantizaton"
-        print(f"No quantization applied, export {device} model")
+        print(f"qformat: {args.qformat}. No quantization applied, export {device} model")
 
     with torch.inference_mode():
         if model_type is None:
@@ -459,38 +467,31 @@ def output_decode(generated_ids, input_shape):
             setattr(model.config, "text_config", full_model_config.text_config)
             setattr(model.config, "architectures", full_model_config.architectures)
 
-        with torch.autocast("cuda"):
-            start_time = time.time()
-            if args.export_fmt == "tensorrt_llm":
-                # Move meta tensor back to device before exporting.
-                try:
-                    from accelerate.hooks import remove_hook_from_module
-
-                    remove_hook_from_module(model, recurse=True)
-                except ImportError:
-                    warnings.warn("accelerate is not installed, hooks will not be removed")
-                    pass
-
-                dtype = None
-                if "w4a8_awq" in args.qformat:
-                    # TensorRT-LLM w4a8 only support fp16 as the dtype.
-                    dtype = torch.float16
-
-                export_tensorrt_llm_checkpoint(
-                    model,
-                    model_type,
-                    dtype=dtype,
-                    export_dir=export_path,
-                    inference_tensor_parallel=args.inference_tensor_parallel,
-                    inference_pipeline_parallel=args.inference_pipeline_parallel,
-                )
-            elif args.export_fmt == "hf":
-                export_hf_checkpoint(
-                    model,
-                    export_dir=export_path,
-                )
-            else:
-                raise NotImplementedError(f"{args.export_fmt} not supported")
+        start_time = time.time()
+        if args.export_fmt == "tensorrt_llm":
+            # Move meta tensor back to device before exporting.
+            remove_hook_from_module(model, recurse=True)
+
+            dtype = None
+            if "w4a8_awq" in args.qformat:
+                # TensorRT-LLM w4a8 only support fp16 as the dtype.
+                dtype = torch.float16
+
+            export_tensorrt_llm_checkpoint(
+                model,
+                model_type,
+                dtype=dtype,
+                export_dir=export_path,
+                inference_tensor_parallel=args.inference_tensor_parallel,
+                inference_pipeline_parallel=args.inference_pipeline_parallel,
+            )
+        elif args.export_fmt == "hf":
+            export_hf_checkpoint(
+                model,
+                export_dir=export_path,
+            )
+        else:
+            raise NotImplementedError(f"{args.export_fmt} not supported")
 
         # Restore default padding and export the tokenizer as well.
         if tokenizer is not None:
@@ -552,8 +553,8 @@ def output_decode(generated_ids, input_shape):
         "--kv_cache_qformat",
         required=False,
         default="fp8",
-        choices=["fp8", "nvfp4", "", "none"],
-        help="Specify KV cache quantization format",
+        choices=["fp8", "nvfp4", "none"],
+        help="Specify KV cache quantization format, default to fp8 if not provided",
     )
     parser.add_argument(
         "--vlm",
 
@@ -142,6 +142,10 @@ if [ -n "$AUTO_QUANTIZE_BITS" ]; then
     PTQ_ARGS+=" --auto_quantize_bits=$AUTO_QUANTIZE_BITS "
 fi
 
+if [ -n "$KV_CACHE_QUANT" ]; then
+    PTQ_ARGS+=" --kv_cache_qformat=$KV_CACHE_QUANT "
+fi
+
 if $TRUST_REMOTE_CODE; then
     PTQ_ARGS+=" --trust_remote_code "
 fi
@@ -163,7 +167,7 @@ fi
 
 if [[ $TASKS =~ "build" ]] || [[ ! -d "$ENGINE_DIR" ]] || [[ ! $(ls -A $ENGINE_DIR) ]]; then
 
-    if [ "$EXPORT_FORMAT" == "hf" ] && ([ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ] && ["$KV_CACHE_QUANT" == ""]); then
+    if [ "$EXPORT_FORMAT" == "hf" ] && ([ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]); then
         if  [ -d "$MODEL_PATH" ]; then
             MODEL_CONFIG_EXIST=true
             MODEL_CONFIG=$MODEL_PATH/config.json
@@ -187,7 +191,6 @@ if [[ $TASKS =~ "build" ]] || [[ ! -d "$ENGINE_DIR" ]] || [[ ! $(ls -A $ENGINE_D
             --inference_tensor_parallel=$TP \
             --inference_pipeline_parallel=$PP \
             --export_fmt=$EXPORT_FORMAT \
-            --kv_cache_qformat=$KV_CACHE_QUANT \
             $PTQ_ARGS \
             $AWQ_ARGS
     else
 
@@ -21,7 +21,7 @@ parse_options() {
     MODEL_TYPE=""
     MODEL_PATH=""
     QFORMAT=""
-    KV_CACHE_QUANT="fp8"
+    KV_CACHE_QUANT=""
     TP=1
     CALIB_TP=
     PP=1
 
@@ -63,14 +63,6 @@ while [ $# -gt 0 ]; do
       if [[ "$1" != *=* ]]; then shift; fi
       EAGLE_NUM_LAYERS="${1#*=}"
       ;;
-    --redrafter_predict_n_tokens*)
-      if [[ "$1" != *=* ]]; then shift; fi
-      REDRAFTER_TOKENS="${1#*=}"
-      ;;
-    --redrafter_num_layers*)
-      if [[ "$1" != *=* ]]; then shift; fi
-      REDRAFTER_NUM_LAYERS="${1#*=}"
-      ;;
     --fsdp_transformer_layer_cls_to_wrap*)
       if [[ "$1" != *=* ]]; then shift; fi
       FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP="${1#*=}"
@@ -118,8 +110,6 @@ if [[ "$MODE" == "medusa" ]]; then
   SPECULATIVE_ARGS="--medusa_num_heads $MEDUSA_NUM_HEADS --medusa_num_layers $MEDUSA_NUM_LAYERS"
 elif [[ "$MODE" == "eagle" ]]; then
   SPECULATIVE_ARGS="--eagle_num_layers $EAGLE_NUM_LAYERS"
-elif [[ "$MODE" == "redrafter" ]]; then
-  SPECULATIVE_ARGS="--redrafter_predict_n_tokens $REDRAFTER_TOKENS --redrafter_num_layers $REDRAFTER_NUM_LAYERS"
 else
   echo "Only medusa and eagle supported for now!"
   exit 1