Skip to content

Commit 3708535

Browse files
Update files on Github (#258)
Signed-off-by: Keval Morabia <[email protected]>
1 parent c19ef08 commit 3708535

File tree

95 files changed

+4356
-1127
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

95 files changed

+4356
-1127
lines changed

CHANGELOG-Windows.rst

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,14 @@
22
Model Optimizer Changelog (Windows)
33
===================================
44

5+
0.33 (2025-07-21)
6+
^^^^^^^^^^^^^^^^^
7+
8+
**New Features**
9+
10+
- TensorRT Model Optimizer for Windows now supports `NvTensorRtRtx <https://onnxruntime.ai/docs/execution-providers/TensorRTRTX-ExecutionProvider.html>`_ execution-provider.
11+
12+
513
0.27 (2025-04-30)
614
^^^^^^^^^^^^^^^^^
715

CHANGELOG.rst

Lines changed: 4 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,13 +8,15 @@ Model Optimizer Changelog (Linux)
88

99
**Deprecations**
1010

11-
- Deprecate ``torch<2.5`` support.
11+
- Deprecate ``torch<2.6`` support.
1212

1313
**New Features**
1414

1515
- (Experimental) Add quantization support for custom TensorRT op in ONNX models.
1616
- Add support for Minifinetuning (MFT; https://arxiv.org/abs/2506.15702) self-corrective distillation, which enables training on small datasets with severely mitigated catastrophic forgetting.
1717
- Add tree decoding support for Megatron Eagle models.
18+
- For most VLMs, we now explicitly disable quant on the vision part so we add them to the excluded_modules during HF export.
19+
- Add support for ``hidden_size`` and ``num_layers`` pruning for Megatron Core Mamba models in ``mcore_gpt_minitron`` mode.
1820

1921
0.33 (2025-07-14)
2022
^^^^^^^^^^^^^^^^^
@@ -36,6 +38,7 @@ Model Optimizer Changelog (Linux)
3638
- ModelOpt now supports quantization of tensor-parallel sharded Huggingface transformer models. This requires ``transformers>=4.52.0``.
3739
- Support quantization of FSDP2 wrapped models and add FSDP2 support in the ``llm_qat`` example.
3840
- Add NeMo 2 Simplified Flow examples for quantization aware training/distillation (QAT/QAD), speculative decoding, pruning & distillation.
41+
- Fix a Qwen3 MOE model export issue.
3942

4043
0.31 (2025-06-04)
4144
^^^^^^^^^^^^^^^^^

docs/source/getting_started/_installation_for_Linux.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -16,7 +16,7 @@ Latest Model Optimizer (``nvidia-modelopt``) currently has the following system
1616
+-------------------------+-----------------------------+
1717
| CUDA | >=12.0 |
1818
+-------------------------+-----------------------------+
19-
| PyTorch | >=2.4 |
19+
| PyTorch | >=2.6 |
2020
+-------------------------+-----------------------------+
2121
| TensorRT-LLM (Optional) | 0.20 |
2222
+-------------------------+-----------------------------+

docs/source/guides/6_save_load.rst

Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -166,6 +166,13 @@ Here is an example of how to enable ModelOpt save/restore with the Huggingface A
166166
# Save the ModelOpt-modified model architecture and weights using Huggingface APIs
167167
model.save_pretrained(f"ModelOpt_{model_path}")
168168
169+
By default, the modelopt state is saved in the same directory as the model weights.
170+
You can disable this by setting the ``save_modelopt_state`` to ``False`` in the ``save_pretrained`` API, as shown below:
171+
172+
.. code-block:: python
173+
174+
model.save_pretrained(f"ModelOpt_{model_path}", save_modelopt_state=False)
175+
169176
The model saved as above can be restored using the Huggingface ``from_pretrained`` API.
170177
Do not forget to call :meth:`mto.enable_huggingface_checkpointing() <modelopt.torch.opt.plugins.huggingface.enable_huggingface_checkpointing>`
171178
before loading the model. This needs to be done only once in the program.

docs/source/guides/8_autocast.rst

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ AutoCast (ONNX)
22
###############
33

44
AutoCast is a tool for converting FP32 ONNX models to mixed precision FP32-FP16 or FP32-BF16 models.
5-
While casting FP32 to FP6/BF16, some nodes might be more sensitive to effecting accuracy.
5+
While casting FP32 to FP16/BF16, some nodes might be more sensitive to effecting accuracy.
66
AutoCast intelligently selects nodes to keep in FP32 precision to maintain model accuracy while benefiting from
77
reduced precision on the rest of the nodes. AutoCast automatically injects cast operations around the selected
88
nodes.

examples/llm_ptq/example_utils.py

Lines changed: 14 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -18,15 +18,10 @@
1818
from typing import Any
1919

2020
import torch
21+
import transformers
2122
from accelerate import infer_auto_device_map, init_empty_weights
2223
from accelerate.utils import get_max_memory
23-
from transformers import (
24-
AutoConfig,
25-
AutoModelForCausalLM,
26-
AutoProcessor,
27-
AutoTokenizer,
28-
Llama4ForConditionalGeneration,
29-
)
24+
from transformers import AutoConfig, AutoModelForCausalLM, AutoProcessor, AutoTokenizer
3025

3126
from modelopt.torch.utils.image_processor import MllamaImageProcessor
3227

@@ -148,7 +143,7 @@ def get_model(
148143
if device == "cpu":
149144
device_map = "cpu"
150145

151-
config_kwargs = {"trust_remote_code": trust_remote_code}
146+
config_kwargs = {"trust_remote_code": trust_remote_code} if trust_remote_code else {}
152147
if attn_implementation is not None:
153148
config_kwargs["attn_implementation"] = attn_implementation
154149

@@ -182,61 +177,24 @@ def get_model(
182177
max_memory = {key: value * gpu_mem_percentage for key, value in max_memory.items()}
183178
model_kwargs["max_memory"] = max_memory
184179

180+
if hf_config.model_type == "bart":
181+
# device_map "auto" and "cuda" triggers error regarding meta tensor from safetensors
182+
device_map = None
183+
185184
if is_speculative(hf_config):
186185
model = AutoModelForCausalLM.from_pretrained(
187186
ckpt_path,
188187
device_map=device_map,
189188
**model_kwargs,
190189
)
191-
elif hf_config.model_type == "llava":
192-
from transformers import LlavaForConditionalGeneration
193-
194-
hf_llava = LlavaForConditionalGeneration.from_pretrained(
195-
ckpt_path, device_map=device_map, **model_kwargs
196-
)
197-
model = hf_llava.language_model
198-
elif hf_config.model_type == "t5":
199-
from transformers import AutoModelForSeq2SeqLM
200-
201-
model = AutoModelForSeq2SeqLM.from_pretrained(
202-
ckpt_path, device_map=device_map, **model_kwargs
203-
)
204-
elif hf_config.model_type == "bart":
205-
from transformers import AutoModelForSeq2SeqLM
206-
207-
# device_map "auto" and "cuda" triggers error regarding meta tensor from safetensors
208-
model = AutoModelForSeq2SeqLM.from_pretrained(
209-
ckpt_path, device_map=None, **model_kwargs
210-
).to(device)
211-
elif hf_config.model_type == "whisper":
212-
from transformers import WhisperForConditionalGeneration
213-
214-
model = WhisperForConditionalGeneration.from_pretrained(
215-
ckpt_path, device_map=device_map, **model_kwargs
216-
)
217-
elif hf_config.model_type == "glm":
218-
from transformers import AutoModelForSeq2SeqLM
190+
else:
191+
architecture = hf_config.architectures[0]
219192

220-
model = AutoModelForSeq2SeqLM.from_pretrained(
221-
ckpt_path,
222-
device_map="cuda",
223-
**model_kwargs,
193+
assert hasattr(transformers, architecture), (
194+
f"Architecture {architecture} not found in transformers: {transformers.__version__}"
224195
)
225-
elif hf_config.model_type == "mllama":
226-
from transformers import MllamaForConditionalGeneration
196+
auto_model_module = getattr(transformers, architecture)
227197

228-
model = MllamaForConditionalGeneration.from_pretrained(
229-
ckpt_path,
230-
device_map=device_map,
231-
**model_kwargs,
232-
)
233-
elif hf_config.model_type == "llama4":
234-
model = Llama4ForConditionalGeneration.from_pretrained(
235-
ckpt_path,
236-
device_map=device_map,
237-
**model_kwargs,
238-
)
239-
else:
240198
with init_empty_weights():
241199
# When computing the device_map, assuming half precision by default,
242200
# unless specified by the hf_config.
@@ -246,7 +204,7 @@ def get_model(
246204
# DeciLMForCausalLM does not support max_memory argument
247205
if "architectures" in hf_config and "DeciLMForCausalLM" in hf_config.architectures:
248206
model_kwargs2.pop("max_memory", None)
249-
model = AutoModelForCausalLM.from_config(
207+
model = auto_model_module._from_config(
250208
hf_config,
251209
**model_kwargs2,
252210
)
@@ -269,7 +227,7 @@ def get_model(
269227
)
270228
model_kwargs["max_memory"] = max_memory
271229

272-
model = AutoModelForCausalLM.from_pretrained(
230+
model = auto_model_module.from_pretrained(
273231
ckpt_path,
274232
device_map=device_map,
275233
**model_kwargs,

examples/llm_ptq/hf_ptq.py

Lines changed: 44 additions & 30 deletions
Original file line numberDiff line numberDiff line change
@@ -46,6 +46,7 @@
4646
create_forward_loop,
4747
get_dataset_dataloader,
4848
get_max_batch_size,
49+
get_supported_datasets,
4950
)
5051
from modelopt.torch.utils.image_processor import MllamaImageProcessor
5152
from modelopt.torch.utils.memory_monitor import launch_memory_monitor
@@ -195,6 +196,9 @@ def main(args):
195196
# launch a memory monitor to read the currently used GPU memory.
196197
launch_memory_monitor()
197198

199+
# Force eager execution for all model types.
200+
torch.compiler.set_stance("force_eager")
201+
198202
# Check that only one quantization format is provided for non auto_quant case
199203
if not args.auto_quantize_bits:
200204
assert len(args.qformat.split(",")) == 1, (
@@ -267,14 +271,6 @@ def main(args):
267271
full_model = model
268272

269273
if model_type == "mllama":
270-
if args.dataset is None:
271-
args.dataset = "scienceqa"
272-
warnings.warn(
273-
"Currently only the scienceqa dataset is supported for the mllama model. "
274-
"Overriding dataset to scienceqa."
275-
)
276-
elif args.dataset != "scienceqa":
277-
raise ValueError("Only the scienceqa dataset is supported for the mllama model.")
278274
processor = get_processor(
279275
args.pyt_ckpt_path,
280276
model_type,
@@ -283,20 +279,12 @@ def main(args):
283279
attn_implementation=args.attn_implementation,
284280
)
285281
elif model_type == "whisper":
286-
if args.dataset is None:
287-
args.dataset = "peoples_speech"
288-
warnings.warn(
289-
"Currently only the peoples_speech dataset is supported for the whisper model. "
290-
"Overriding dataset to peoples_speech."
291-
)
292-
elif args.dataset != "peoples_speech":
293-
raise ValueError("Only the peoples_speech dataset is supported for the whisper model.")
294282
processor = get_processor(
295283
args.pyt_ckpt_path, model_type, device, trust_remote_code=args.trust_remote_code
296284
)
297285
else:
298286
if args.dataset is None:
299-
args.dataset = "cnn_dailymail"
287+
args.dataset = ["cnn_dailymail"]
300288
warnings.warn("No dataset specified. Defaulting to cnn_dailymail.")
301289
tokenizer = get_tokenizer(args.pyt_ckpt_path, trust_remote_code=args.trust_remote_code)
302290
default_padding_side = tokenizer.padding_side
@@ -305,16 +293,31 @@ def main(args):
305293

306294
# We only quantize the language model for VLMs other than the type supported above.
307295
if hasattr(model, "language_model"):
308-
assert model_type == "llama4", (
309-
"Only llama4 should reach here. Please uncomment this check if you are modelopt developers."
310-
)
296+
parent_model = model # llama4 case
297+
if isinstance(type(model).__dict__.get("language_model"), property):
298+
assert hasattr(model, "model") and hasattr(model.model, "language_model"), (
299+
"Expected language_model in model.model, but attribute not found. "
300+
"This may indicate an unsupported model structure."
301+
)
302+
parent_model = model.model # gemma3, qwen2.5 VL case
303+
304+
disabled_quant_cfg = {
305+
"quant_cfg": {"default": {"enable": False}},
306+
"algorithm": "max",
307+
}
308+
309+
for name, child in parent_model.named_children():
310+
# Apply disabled quant to all children except language_model so we can exclude them during HF export.
311+
if name != "language_model":
312+
mtq.quantize(child, disabled_quant_cfg, forward_loop=None)
313+
311314
model = model.language_model
312315

313316
if args.sparsity_fmt != "dense":
314317
if args.batch_size == 0:
315318
# Sparse algorithm takes more GPU memory so we reduce the batch_size by 4.
316319
args.batch_size = max(get_max_batch_size(model) // 4, 1)
317-
args.batch_size = min(args.batch_size, args.calib_size)
320+
args.batch_size = min(args.batch_size, sum(args.calib_size))
318321

319322
print(f"Use calib batch_size {args.batch_size}")
320323

@@ -373,7 +376,7 @@ def main(args):
373376
sample_input_single_batch=sample_input_single_batch,
374377
enable_grad=run_auto_quant,
375378
)
376-
args.batch_size = min(args.batch_size, args.calib_size)
379+
args.batch_size = min(args.batch_size, sum(args.calib_size))
377380

378381
print(f"Use calib batch_size {args.batch_size}")
379382

@@ -383,17 +386,17 @@ def main(args):
383386
"The MllamaImageProcessor must be set."
384387
)
385388
calib_dataloader = get_vlm_dataset_dataloader(
386-
dataset_name=args.dataset,
389+
dataset_name=args.dataset[0] if args.dataset else "scienceqa",
387390
processor=processor,
388391
batch_size=args.batch_size,
389-
num_samples=args.calib_size,
392+
num_samples=args.calib_size[0],
390393
)
391394
elif model_type == "whisper":
392395
assert processor is not None and isinstance(processor, WhisperProcessor), (
393396
"The AutoProcessor must be set."
394397
)
395398
calib_dataloader, first_text = get_speech_dataset_dataloader(
396-
dataset_name=args.dataset,
399+
dataset_name=args.dataset[0] if args.dataset else "peoples_speech",
397400
processor=processor,
398401
batch_size=args.batch_size,
399402
num_samples=args.calib_size,
@@ -454,7 +457,7 @@ def main(args):
454457
"input_features" if model_type == "whisper" else "input_ids"
455458
][0:1]
456459
try:
457-
generated_ids_before_ptq = model.generate(input_ids, max_new_tokens=100)
460+
generated_ids_before_ptq = full_model.generate(input_ids, max_new_tokens=100)
458461
except Exception as e:
459462
print(
460463
"Error during model generation. Please check if your transformers version is "
@@ -472,7 +475,8 @@ def main(args):
472475
torch.cuda.empty_cache()
473476
generated_ids_after_ptq = None
474477
if model_type != "llama4":
475-
generated_ids_after_ptq = model.generate(input_ids, max_new_tokens=100)
478+
# Our fake quantizer may not be fully compatible with torch.compile.
479+
generated_ids_after_ptq = full_model.generate(input_ids, max_new_tokens=100)
476480
else:
477481
warnings.warn(
478482
"Llama4 Maverick generation after quantization has a bug. Skipping generation sample."
@@ -600,15 +604,23 @@ def output_decode(generated_ids, input_shape):
600604
default=0,
601605
)
602606
parser.add_argument(
603-
"--calib_size", help="Number of samples for calibration.", type=int, default=512
607+
"--calib_size",
608+
help=(
609+
"Number of samples for calibration. If a comma separated list of values is provided, "
610+
"each value will be used as the calibration size for the corresponding dataset."
611+
),
612+
type=str,
613+
default="512",
604614
)
605615
parser.add_argument("--export_path", default="exported_model")
606616
parser.add_argument(
607617
"--dataset",
608-
help="name of dataset.",
618+
help=(
619+
f"name of a dataset, or a comma separated list of datasets. "
620+
f"dataset choices are {get_supported_datasets()}"
621+
),
609622
type=str,
610623
default=None,
611-
choices=["magpie", "cnn_dailymail", "pile", "pg19", "wikipedia"],
612624
)
613625
parser.add_argument("--inference_tensor_parallel", type=int, default=1)
614626
parser.add_argument("--inference_pipeline_parallel", type=int, default=1)
@@ -695,4 +707,6 @@ def output_decode(generated_ids, input_shape):
695707

696708
args = parser.parse_args()
697709

710+
args.dataset = args.dataset.split(",") if args.dataset else None
711+
args.calib_size = [int(num_sample) for num_sample in args.calib_size.split(",")]
698712
main(args)

examples/llm_ptq/run_tensorrt_llm.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -66,7 +66,7 @@ def run(args):
6666

6767
print("TensorRT-LLM example outputs:")
6868

69-
llm = LLM(args.engine_dir, tokenizer=tokenizer)
69+
llm = LLM(args.engine_dir, tokenizer=tokenizer, max_batch_size=len(input_texts))
7070
torch.cuda.cudart().cudaProfilerStart()
7171
outputs = llm.generate_text(input_texts, args.max_output_len)
7272
torch.cuda.cudart().cudaProfilerStop()

examples/llm_qat/launch.sh

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -166,13 +166,14 @@ if [[ "${DISTILL}" == "True" ]]; then
166166
FSDP_ARGS="$FSDP_ARGS --fsdp_cpu_ram_efficient_loading False"
167167
fi
168168

169-
# real quantization does not work with FSDP
170-
if [[ "${COMPRESS,,}" == "true" ]]; then
171-
echo "Compression is not supported with FSDP. Disabling FSDP."
169+
# real quantization does not work with FSDP, only works with FSDP2
170+
if [[ "${COMPRESS,,}" == "true" && "${USE_FSDP2,,}" != "true" ]]; then
171+
echo "Compression is not supported with FSDP. Disabling FSDP and using DDP."
172172
FSDP_ARGS=""
173173
CONFIG_FILE="ddp.yaml"
174174
fi
175175

176+
176177
CMD="accelerate launch --config-file accelerate_config/$CONFIG_FILE $FSDP_ARGS \
177178
main.py \
178179
--model_name_or_path $MODEL \

0 commit comments

Comments
 (0)