Skip to content

Commit 9cb36cb

Browse files
Update files for 0.27.1 release
1 parent 54f4e3c commit 9cb36cb

37 files changed

+571
-583
lines changed

CHANGELOG.rst

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -10,14 +10,16 @@ Model Optimizer Changelog (Linux)
1010

1111
**New Features**
1212

13-
- New model support in the ``llm_ptq`` example: OpenAI Whisper.
13+
- New model support in the ``llm_ptq`` example: OpenAI Whisper. Experimental support: Llama4, QwQ, Qwen MOE.
1414
- Blockwise FP8 quantization support in unified model export.
1515
- Add quantization support to the Transformer Engine Linear module.
1616
- Add support for SVDQuant. Currently, only simulation is available; real deployment (for example, TensorRT deployment) support is coming soon.
1717
- To support distributed checkpoint resume expert-parallel (EP), ``modelopt_state`` in Megatron Core distributed checkpoint (used in NeMo and Megatron-LM) is stored differently. The legacy ``modelopt_state`` in the distributed checkpoint generated by previous modelopt version can still be loaded in 0.27 and 0.29 but will need to be stored in the new format.
1818
- Add triton-based NVFP4 quantization kernel that delivers approximately 40% performance improvement over the previous implementation.
1919
- Add a new API :meth:`mtq.compress <modelopt.torch.quantization.compress>` for model compression for weights after quantization.
2020
- Add option to simplify ONNX model before quantization is performed.
21+
- Add FP4 KV cache support for unified HF and TensorRT-LLM export.
22+
- Add speculative decoding support to Multi-Token Prediction (MTP) in Megatron Core models.
2123
- (Experimental) Improve support for ONNX models with custom TensorRT op:
2224
- Add support for ``--calibration_shapes`` flag.
2325
- Add automatic type and shape tensor propagation for full ORT support with TensorRT EP.

README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -18,6 +18,7 @@
1818

1919
## Latest News
2020

21+
- [2025/04/05] [NVIDIA Accelerates Inference on Meta Llama 4 Scout and Maverick](https://developer.nvidia.com/blog/nvidia-accelerates-inference-on-meta-llama-4-scout-and-maverick/). Check out how to quantize Llama4 for deployment acceleration [here](./examples/llm_ptq/README.md#llama-4)
2122
- [2025/03/18] [World's Fastest DeepSeek-R1 Inference with Blackwell FP4 & Increasing Image Generation Efficiency on Blackwell](https://developer.nvidia.com/blog/nvidia-blackwell-delivers-world-record-deepseek-r1-inference-performance/)
2223
- [2025/02/25] Model Optimizer quantized NVFP4 models available on Hugging Face for download: [DeepSeek-R1-FP4](https://huggingface.co/nvidia/DeepSeek-R1-FP4), [Llama-3.3-70B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.3-70B-Instruct-FP4), [Llama-3.1-405B-Instruct-FP4](https://huggingface.co/nvidia/Llama-3.1-405B-Instruct-FP4)
2324
- [2025/01/28] Model Optimizer has added support for NVFP4. Check out an example of NVFP4 PTQ [here](./examples/llm_ptq/README.md#model-quantization-and-trt-llm-conversion).

examples/diffusers/quantization/calib/plugin_calib.py

Lines changed: 1 addition & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -38,13 +38,7 @@ def collect(self, x):
3838
RuntimeError: If amax shape changes
3939
"""
4040
# Swap axis to reduce.
41-
axis = self._axis if isinstance(self._axis, (list, tuple)) else [self._axis]
42-
# Handle negative axis.
43-
axis = [x.dim() + i if isinstance(i, int) and i < 0 else i for i in axis]
44-
reduce_axis = []
45-
for i in range(x.dim()):
46-
if i not in axis:
47-
reduce_axis.append(i)
41+
reduce_axis = quant_utils.convert_quantization_axis_to_reduce_axis(x, self._axis)
4842
local_amax = quant_utils.reduce_amax(x, axis=reduce_axis).detach()
4943
_cur_step = self.i % self.total_step
5044
if _cur_step not in self.data.keys():

examples/llm_ptq/README.md

Lines changed: 13 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -56,6 +56,16 @@ scripts/huggingface_example.sh --model $HF_PATH --quant [fp8|nvfp4|int8_sq|int4_
5656
5757
> *Calibration by default uses left padding_side for the Huggingface tokenizer as it usually leads to lower accuracy loss. The exported tokenizer files restores the default padding_side.*
5858
59+
#### Llama 4
60+
61+
We support FP8 and NVFP4 quantized Llama 4 model Hugging Face checkpoint export using the following command:
62+
63+
```bash
64+
python hf_ptq.py --pyt_ckpt_path=<llama4 model path> --export_path=<quantized hf checkpoint> --qformat=[fp8|nvfp4] --export_fmt=hf
65+
```
66+
67+
The quantized checkpoint can be deployed following the TensorRT-LLM instructions.
68+
5969
#### For NeMo models like [nemotron](https://huggingface.co/nvidia/nemotron-3-8b-base-4k):
6070

6171
NeMo PTQ requires the NeMo package installed. It's recommended to start from the NeMo containers like `nvcr.io/nvidia/nemo:24.07` or latest `nvcr.io/nvidia/nemo:dev` directly.
@@ -91,6 +101,7 @@ Model | fp8 | int8_sq | int4_awq | w4a8_awq<sup>1</sup> | nvfp4<sup>5</sup> |
91101
GPTJ | Yes | Yes | Yes | Yes | -
92102
LLAMA 2 | Yes | Yes | Yes | Yes | -
93103
LLAMA 3, 3.1, 3.3 | Yes | No | Yes | Yes<sup>3</sup> | Yes
104+
LLAMA 4 | Yes | No | No | No | Yes
94105
LLAMA 2 (Nemo) | Yes | Yes | Yes | Yes | -
95106
CodeLlama | Yes | Yes | Yes | No | -
96107
Mistral | Yes | Yes | Yes | No | Yes
@@ -110,6 +121,8 @@ Gemma 2 9B, 27B | Yes<sup>2</sup> | No | Yes | No | -
110121
RecurrentGemma 2B | Yes | Yes | Yes | No | -
111122
StarCoder 2 | Yes | Yes | Yes | No | -
112123
QWen 2, 2.5 <sup>4</sup> | Yes | Yes | Yes | Yes | Yes
124+
QWen MOE | Yes | - | - | - | Yes
125+
QwQ | Yes | - | - | - | Yes
113126
DBRX | Yes | No | No | No | -
114127
InternLM2 | Yes | No | Yes | Yes<sup>3</sup> | -
115128
Exaone | Yes | Yes | Yes | Yes | -

examples/llm_ptq/example_utils.py

Lines changed: 24 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,15 @@
2121

2222
from modelopt.torch.utils.image_processor import MllamaImageProcessor
2323

24+
SPECULATIVE_MODEL_LIST = ["Eagle", "Medusa"]
25+
26+
27+
def is_speculative(hf_config):
28+
for name in SPECULATIVE_MODEL_LIST:
29+
if name in hf_config.architectures[0]:
30+
return True
31+
return False
32+
2433

2534
def get_mode_type_from_engine_dir(engine_dir_str):
2635
# Split the path by '/' and get the last part
@@ -134,7 +143,14 @@ def get_model(ckpt_path, device="cuda", gpu_mem_percentage=0.8, trust_remote_cod
134143
else:
135144
hf_config = AutoConfig.from_pretrained(ckpt_path, trust_remote_code=trust_remote_code)
136145

137-
if hf_config.model_type == "llava":
146+
if is_speculative(hf_config):
147+
model = AutoModelForCausalLM.from_pretrained(
148+
ckpt_path,
149+
device_map=device_map,
150+
**model_kwargs,
151+
trust_remote_code=trust_remote_code,
152+
)
153+
elif hf_config.model_type == "llava":
138154
from transformers import LlavaForConditionalGeneration
139155

140156
hf_llava = LlavaForConditionalGeneration.from_pretrained(
@@ -175,6 +191,13 @@ def get_model(ckpt_path, device="cuda", gpu_mem_percentage=0.8, trust_remote_cod
175191
**model_kwargs,
176192
trust_remote_code=trust_remote_code,
177193
)
194+
elif hf_config.model_type == "llama4":
195+
model = AutoModelForCausalLM.from_pretrained(
196+
ckpt_path,
197+
device_map=device_map,
198+
**model_kwargs,
199+
trust_remote_code=trust_remote_code,
200+
)
178201

179202
else:
180203
from accelerate import infer_auto_device_map, init_empty_weights

examples/llm_ptq/hf_ptq.py

Lines changed: 60 additions & 59 deletions
Original file line numberDiff line numberDiff line change
@@ -21,6 +21,7 @@
2121

2222
import numpy as np
2323
import torch
24+
from accelerate.hooks import remove_hook_from_module
2425
from example_utils import get_model, get_processor, get_tokenizer, is_enc_dec, is_model_on_gpu
2526
from transformers import PreTrainedTokenizer, PreTrainedTokenizerFast, WhisperProcessor
2627

@@ -97,8 +98,9 @@ def auto_quantize(
9798
verbose=True,
9899
disabled_layers=["*lm_head*"],
99100
)
101+
100102
# We need to explicitly calibrate for kv cache quantization
101-
enable_quant_kv_cache = args.kv_cache_qformat not in ["", "none"]
103+
enable_quant_kv_cache = args.kv_cache_qformat != "none"
102104
print(f"{'Enable' if enable_quant_kv_cache else 'Disable'} KV cache quantization")
103105
if enable_quant_kv_cache:
104106
kv_cache_quant_cfg = getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"]
@@ -262,8 +264,7 @@ def main(args):
262264
)
263265
mts.export(model)
264266

265-
enable_quant_kv_cache = args.kv_cache_qformat not in ["", "none"]
266-
if args.qformat or enable_quant_kv_cache:
267+
if args.auto_quantize_bits or args.qformat in QUANT_CFG_CHOICES:
267268
# If any qformat provided is not fp8, assert model is on GPU
268269
if args.qformat not in ["fp8", "nvfp4"]:
269270
assert is_model_on_gpu(model), (
@@ -348,12 +349,11 @@ def main(args):
348349

349350
quant_cfg = {}
350351
if not args.auto_quantize_bits:
351-
assert args.qformat in QUANT_CFG_CHOICES or enable_quant_kv_cache, (
352+
assert args.qformat in QUANT_CFG_CHOICES, (
352353
f"Unsupported quantization format: {args.qformat} with {args.kv_cache_qformat} KV cache"
353354
)
354355

355-
if args.qformat in QUANT_CFG_CHOICES:
356-
quant_cfg = getattr(mtq, QUANT_CFG_CHOICES[args.qformat])
356+
quant_cfg = getattr(mtq, QUANT_CFG_CHOICES[args.qformat])
357357

358358
if "awq" in args.qformat:
359359
quant_cfg = copy.deepcopy(getattr(mtq, QUANT_CFG_CHOICES[args.qformat]))
@@ -368,6 +368,7 @@ def main(args):
368368
if "w4a8_awq" == args.qformat and model_type in ["gemma", "mpt"]:
369369
quant_cfg["algorithm"] = {"method": "awq_lite", "alpha_step": 1}
370370

371+
enable_quant_kv_cache = args.kv_cache_qformat != "none"
371372
print(f"{'Enable' if enable_quant_kv_cache else 'Disable'} KV cache quantization")
372373

373374
# Check if any bmm_quantizer is in the quant_cfg. If so, we need to enable the bmm_quantizer.
@@ -391,18 +392,24 @@ def main(args):
391392
input_ids = next(iter(calib_dataloader))[
392393
"input_features" if model_type == "whisper" else "input_ids"
393394
][0:1]
394-
with torch.autocast("cuda"):
395-
generated_ids_before_ptq = model.generate(input_ids, max_new_tokens=100)
395+
generated_ids_before_ptq = model.generate(input_ids, max_new_tokens=100)
396396

397-
model = quantize_model(model, quant_cfg, args, calib_dataloader)
398-
if args.compress:
399-
mtq.compress(model)
397+
model = quantize_model(model, quant_cfg, args, calib_dataloader)
398+
if args.compress:
399+
mtq.compress(model)
400400
# Lets print the quantization summary
401-
if args.verbose:
402-
mtq.print_quant_summary(model)
401+
if args.verbose:
402+
mtq.print_quant_summary(model)
403403

404-
# Run some samples
404+
# Run some samples
405+
torch.cuda.empty_cache()
406+
generated_ids_after_ptq = None
407+
if model_type != "llama4":
405408
generated_ids_after_ptq = model.generate(input_ids, max_new_tokens=100)
409+
else:
410+
warnings.warn(
411+
"Llama4 Maverick generation after quantization has a bug. Skipping generation sample."
412+
)
406413

407414
def input_decode(input_ids):
408415
if processor is not None and isinstance(processor, MllamaImageProcessor):
@@ -429,20 +436,21 @@ def output_decode(generated_ids, input_shape):
429436
else:
430437
raise ValueError("The processor or tokenizer must be set")
431438

432-
print("--------")
433-
print(f"example test input: {input_decode(input_ids)}")
434-
print("--------")
435-
print(
436-
f"example outputs before ptq: {output_decode(generated_ids_before_ptq, input_ids.shape[1])}"
437-
)
438-
print("--------")
439-
print(
440-
f"example outputs after ptq: {output_decode(generated_ids_after_ptq, input_ids.shape[1])}"
441-
)
439+
if generated_ids_after_ptq is not None:
440+
print("--------")
441+
print(f"example test input: {input_decode(input_ids)}")
442+
print("--------")
443+
print(
444+
f"example outputs before ptq: {output_decode(generated_ids_before_ptq, input_ids.shape[1])}"
445+
)
446+
print("--------")
447+
print(
448+
f"example outputs after ptq: {output_decode(generated_ids_after_ptq, input_ids.shape[1])}"
449+
)
442450

443451
else:
444452
assert model_type != "dbrx", f"Does not support export {model_type} without quantizaton"
445-
print(f"No quantization applied, export {device} model")
453+
print(f"qformat: {args.qformat}. No quantization applied, export {device} model")
446454

447455
with torch.inference_mode():
448456
if model_type is None:
@@ -459,38 +467,31 @@ def output_decode(generated_ids, input_shape):
459467
setattr(model.config, "text_config", full_model_config.text_config)
460468
setattr(model.config, "architectures", full_model_config.architectures)
461469

462-
with torch.autocast("cuda"):
463-
start_time = time.time()
464-
if args.export_fmt == "tensorrt_llm":
465-
# Move meta tensor back to device before exporting.
466-
try:
467-
from accelerate.hooks import remove_hook_from_module
468-
469-
remove_hook_from_module(model, recurse=True)
470-
except ImportError:
471-
warnings.warn("accelerate is not installed, hooks will not be removed")
472-
pass
473-
474-
dtype = None
475-
if "w4a8_awq" in args.qformat:
476-
# TensorRT-LLM w4a8 only support fp16 as the dtype.
477-
dtype = torch.float16
478-
479-
export_tensorrt_llm_checkpoint(
480-
model,
481-
model_type,
482-
dtype=dtype,
483-
export_dir=export_path,
484-
inference_tensor_parallel=args.inference_tensor_parallel,
485-
inference_pipeline_parallel=args.inference_pipeline_parallel,
486-
)
487-
elif args.export_fmt == "hf":
488-
export_hf_checkpoint(
489-
model,
490-
export_dir=export_path,
491-
)
492-
else:
493-
raise NotImplementedError(f"{args.export_fmt} not supported")
470+
start_time = time.time()
471+
if args.export_fmt == "tensorrt_llm":
472+
# Move meta tensor back to device before exporting.
473+
remove_hook_from_module(model, recurse=True)
474+
475+
dtype = None
476+
if "w4a8_awq" in args.qformat:
477+
# TensorRT-LLM w4a8 only support fp16 as the dtype.
478+
dtype = torch.float16
479+
480+
export_tensorrt_llm_checkpoint(
481+
model,
482+
model_type,
483+
dtype=dtype,
484+
export_dir=export_path,
485+
inference_tensor_parallel=args.inference_tensor_parallel,
486+
inference_pipeline_parallel=args.inference_pipeline_parallel,
487+
)
488+
elif args.export_fmt == "hf":
489+
export_hf_checkpoint(
490+
model,
491+
export_dir=export_path,
492+
)
493+
else:
494+
raise NotImplementedError(f"{args.export_fmt} not supported")
494495

495496
# Restore default padding and export the tokenizer as well.
496497
if tokenizer is not None:
@@ -552,8 +553,8 @@ def output_decode(generated_ids, input_shape):
552553
"--kv_cache_qformat",
553554
required=False,
554555
default="fp8",
555-
choices=["fp8", "nvfp4", "", "none"],
556-
help="Specify KV cache quantization format",
556+
choices=["fp8", "nvfp4", "none"],
557+
help="Specify KV cache quantization format, default to fp8 if not provided",
557558
)
558559
parser.add_argument(
559560
"--vlm",

examples/llm_ptq/scripts/huggingface_example.sh

Lines changed: 5 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,10 @@ if [ -n "$AUTO_QUANTIZE_BITS" ]; then
142142
PTQ_ARGS+=" --auto_quantize_bits=$AUTO_QUANTIZE_BITS "
143143
fi
144144

145+
if [ -n "$KV_CACHE_QUANT" ]; then
146+
PTQ_ARGS+=" --kv_cache_qformat=$KV_CACHE_QUANT "
147+
fi
148+
145149
if $TRUST_REMOTE_CODE; then
146150
PTQ_ARGS+=" --trust_remote_code "
147151
fi
@@ -163,7 +167,7 @@ fi
163167

164168
if [[ $TASKS =~ "build" ]] || [[ ! -d "$ENGINE_DIR" ]] || [[ ! $(ls -A $ENGINE_DIR) ]]; then
165169

166-
if [ "$EXPORT_FORMAT" == "hf" ] && ([ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ] && ["$KV_CACHE_QUANT" == ""]); then
170+
if [ "$EXPORT_FORMAT" == "hf" ] && ([ "$qformat" == "bf16" ] || [ "$qformat" == "fp16" ]); then
167171
if [ -d "$MODEL_PATH" ]; then
168172
MODEL_CONFIG_EXIST=true
169173
MODEL_CONFIG=$MODEL_PATH/config.json
@@ -187,7 +191,6 @@ if [[ $TASKS =~ "build" ]] || [[ ! -d "$ENGINE_DIR" ]] || [[ ! $(ls -A $ENGINE_D
187191
--inference_tensor_parallel=$TP \
188192
--inference_pipeline_parallel=$PP \
189193
--export_fmt=$EXPORT_FORMAT \
190-
--kv_cache_qformat=$KV_CACHE_QUANT \
191194
$PTQ_ARGS \
192195
$AWQ_ARGS
193196
else

examples/llm_ptq/scripts/parser.sh

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -21,7 +21,7 @@ parse_options() {
2121
MODEL_TYPE=""
2222
MODEL_PATH=""
2323
QFORMAT=""
24-
KV_CACHE_QUANT="fp8"
24+
KV_CACHE_QUANT=""
2525
TP=1
2626
CALIB_TP=
2727
PP=1

examples/speculative_decoding/launch.sh

Lines changed: 0 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -63,14 +63,6 @@ while [ $# -gt 0 ]; do
6363
if [[ "$1" != *=* ]]; then shift; fi
6464
EAGLE_NUM_LAYERS="${1#*=}"
6565
;;
66-
--redrafter_predict_n_tokens*)
67-
if [[ "$1" != *=* ]]; then shift; fi
68-
REDRAFTER_TOKENS="${1#*=}"
69-
;;
70-
--redrafter_num_layers*)
71-
if [[ "$1" != *=* ]]; then shift; fi
72-
REDRAFTER_NUM_LAYERS="${1#*=}"
73-
;;
7466
--fsdp_transformer_layer_cls_to_wrap*)
7567
if [[ "$1" != *=* ]]; then shift; fi
7668
FSDP_TRANSFORMER_LAYER_CLS_TO_WRAP="${1#*=}"
@@ -118,8 +110,6 @@ if [[ "$MODE" == "medusa" ]]; then
118110
SPECULATIVE_ARGS="--medusa_num_heads $MEDUSA_NUM_HEADS --medusa_num_layers $MEDUSA_NUM_LAYERS"
119111
elif [[ "$MODE" == "eagle" ]]; then
120112
SPECULATIVE_ARGS="--eagle_num_layers $EAGLE_NUM_LAYERS"
121-
elif [[ "$MODE" == "redrafter" ]]; then
122-
SPECULATIVE_ARGS="--redrafter_predict_n_tokens $REDRAFTER_TOKENS --redrafter_num_layers $REDRAFTER_NUM_LAYERS"
123113
else
124114
echo "Only medusa and eagle supported for now!"
125115
exit 1

0 commit comments

Comments
 (0)