Skip to content

Commit 0b25bcb

Browse files
authored
Merge branch 'main' into jennifchen/nmh-moe-export
2 parents 7d3245d + 37c4974 commit 0b25bcb

File tree

39 files changed

+2423
-418
lines changed

39 files changed

+2423
-418
lines changed

.github/workflows/code_quality.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: Code Quality
22

33
on:
44
pull_request:
5-
branches: [main, release/*]
5+
branches: [main, release/*, feature/*]
66
schedule:
77
- cron: "0 0 * * *" # Nightly
88
workflow_dispatch: # On-demand

.github/workflows/gpu_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -61,7 +61,7 @@ jobs:
6161
if: needs.check-file-changes.outputs.any_changed == 'true'
6262
# Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
6363
runs-on: linux-amd64-gpu-l4-latest-1
64-
timeout-minutes: 90
64+
timeout-minutes: 120
6565
container: &gpu_container
6666
image: nvcr.io/nvidia/pytorch:25.06-py3
6767
env:
@@ -80,7 +80,7 @@ jobs:
8080
if: ${{ !startsWith(github.ref, 'refs/heads/pull-request/') }}
8181
# Runner list at https://github.com/nv-gha-runners/enterprise-runner-configuration/blob/main/docs/runner-groups.md
8282
runs-on: linux-amd64-gpu-h100-latest-1
83-
timeout-minutes: 90
83+
timeout-minutes: 120
8484
container: *gpu_container
8585
steps: *gpu_steps
8686
gpu-pr-required-check:

.github/workflows/pages.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2,7 +2,7 @@ name: Docs
22

33
on:
44
pull_request:
5-
branches: [main, release/*]
5+
branches: [main, release/*, feature/*]
66
push:
77
branches: [main]
88
schedule:

.github/workflows/unit_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -3,9 +3,9 @@ name: Unit tests
33

44
on:
55
pull_request:
6-
branches: [main, release/*]
6+
branches: [main, release/*, feature/*]
77
push:
8-
branches: [main, release/*]
8+
branches: [main, release/*, feature/*]
99
paths:
1010
- ".github/workflows/unit_tests.yml"
1111
- "modelopt/**"

CHANGELOG.rst

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,9 @@ Model Optimizer Changelog (Linux)
1111
- Support PTQ and fakequant in vLLM for fast evaluation of arbitrary quantization formats. See ``examples/vllm_serve`` for more details.
1212
- Add support for ``nemotron-post-training-dataset-v2`` and ``nemotron-post-training-dataset-v1`` in ``examples/llm_ptq``. Default to a mix of ``cnn_dailymail`` and ``nemotron-post-training-dataset-v2`` (gated dataset accessed using ``HF_TOKEN`` environment variable) if no dataset is specified.
1313
- Allow specifying ``calib_seq`` in ``examples/llm_ptq`` to set the maximum sequence length for calibration.
14+
- Add support for MCore MoE PTQ/QAT/QAD.
15+
- Add support for multi-node PTQ and export with FSDP2 in ``examples/llm_ptq/multinode_ptq.py``. See `examples/llm_ptq/README.md <https://github.com/NVIDIA/TensorRT-Model-Optimizer/tree/main/examples/llm_ptq#multi-node-post-training-quantization-with-fsdp2>`_ for more details.
16+
- Add support for Nemotron Nano VL v1 & v2 models in FP8/NVFP4 PTQ workflow.
1417

1518
**Documentation**
1619

examples/diffusers/quantization/diffusion_trt.py

Lines changed: 85 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,7 @@
2323
update_dynamic_axes,
2424
)
2525
from quantize import ModelType, PipelineManager
26+
from tqdm import tqdm
2627

2728
import modelopt.torch.opt as mto
2829
from modelopt.torch._deploy._runtime import RuntimeRegistry
@@ -58,6 +59,59 @@ def generate_image(pipe, prompt, image_name):
5859
print(f"Image generated saved as {image_name}")
5960

6061

62+
def benchmark_model(
63+
pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half"
64+
):
65+
"""Benchmark the backbone model inference time."""
66+
backbone = pipe.transformer if hasattr(pipe, "transformer") else pipe.unet
67+
68+
backbone_times = []
69+
start_event = torch.cuda.Event(enable_timing=True)
70+
end_event = torch.cuda.Event(enable_timing=True)
71+
72+
def forward_pre_hook(_module, _input):
73+
start_event.record()
74+
75+
def forward_hook(_module, _input, _output):
76+
end_event.record()
77+
torch.cuda.synchronize()
78+
backbone_times.append(start_event.elapsed_time(end_event))
79+
80+
pre_handle = backbone.register_forward_pre_hook(forward_pre_hook)
81+
post_handle = backbone.register_forward_hook(forward_hook)
82+
83+
try:
84+
print(f"Starting warmup: {num_warmup} runs")
85+
for _ in tqdm(range(num_warmup), desc="Warmup"):
86+
with torch.amp.autocast("cuda", dtype=dtype_map[model_dtype]):
87+
_ = pipe(
88+
prompt,
89+
output_type="pil",
90+
num_inference_steps=num_inference_steps,
91+
generator=torch.Generator("cuda").manual_seed(42),
92+
)
93+
94+
backbone_times.clear()
95+
96+
print(f"Starting benchmark: {num_runs} runs")
97+
for _ in tqdm(range(num_runs), desc="Benchmark"):
98+
with torch.amp.autocast("cuda", dtype=dtype_map[model_dtype]):
99+
_ = pipe(
100+
prompt,
101+
output_type="pil",
102+
num_inference_steps=num_inference_steps,
103+
generator=torch.Generator("cuda").manual_seed(42),
104+
)
105+
finally:
106+
pre_handle.remove()
107+
post_handle.remove()
108+
109+
total_backbone_time = sum(backbone_times)
110+
avg_latency = total_backbone_time / (num_runs * num_inference_steps)
111+
print(f"Inference latency of the torch backbone: {avg_latency:.2f} ms")
112+
return avg_latency
113+
114+
61115
def main():
62116
parser = argparse.ArgumentParser()
63117
parser.add_argument(
@@ -92,15 +146,24 @@ def main():
92146
"--onnx-load-path", type=str, default="", help="Path to load the ONNX model"
93147
)
94148
parser.add_argument(
95-
"--trt-engine-load-path", type=str, default=None, help="Path to load the TRT engine"
149+
"--trt-engine-load-path", type=str, default=None, help="Path to load the TensorRT engine"
96150
)
97151
parser.add_argument(
98152
"--dq-only", action="store_true", help="Converts the ONNX model to a dq_only model"
99153
)
100154
parser.add_argument(
101-
"--torch", action="store_true", help="Generate an image using the torch pipeline"
155+
"--torch",
156+
action="store_true",
157+
help="Use the torch pipeline for image generation or benchmarking",
102158
)
103159
parser.add_argument("--save-image-as", type=str, default=None, help="Name of the image to save")
160+
parser.add_argument(
161+
"--benchmark", action="store_true", help="Benchmark the model backbone inference time"
162+
)
163+
parser.add_argument(
164+
"--torch-compile", action="store_true", help="Use torch.compile() on the backbone model"
165+
)
166+
parser.add_argument("--skip-image", action="store_true", help="Skip image generation")
104167
args = parser.parse_args()
105168

106169
image_name = args.save_image_as if args.save_image_as else f"{args.model}.png"
@@ -125,13 +188,25 @@ def main():
125188
if args.restore_from:
126189
mto.restore(backbone, args.restore_from)
127190

191+
if args.torch_compile:
192+
assert args.model_dtype in ["BFloat16", "Float", "Half"], (
193+
"torch.compile() only supports BFloat16 and Float"
194+
)
195+
print("Compiling backbone with torch.compile()...")
196+
backbone = torch.compile(backbone, mode="max-autotune")
197+
128198
if args.torch:
129199
if hasattr(pipe, "transformer"):
130200
pipe.transformer = backbone
131201
elif hasattr(pipe, "unet"):
132202
pipe.unet = backbone
133203
pipe.to("cuda")
134-
generate_image(pipe, args.prompt, image_name)
204+
205+
if args.benchmark:
206+
benchmark_model(pipe, args.prompt, model_dtype=args.model_dtype)
207+
208+
if not args.skip_image:
209+
generate_image(pipe, args.prompt, image_name)
135210
return
136211

137212
backbone.to("cuda")
@@ -211,10 +286,14 @@ def main():
211286
raise ValueError("Pipeline does not have a transformer or unet backbone")
212287
pipe.to("cuda")
213288

214-
generate_image(pipe, args.prompt, image_name)
215-
print(f"Image generated using {args.model} model saved as {image_name}")
289+
if not args.skip_image:
290+
generate_image(pipe, args.prompt, image_name)
291+
print(f"Image generated using {args.model} model saved as {image_name}")
216292

217-
print(f"Inference latency of the backbone of the pipeline is {device_model.get_latency()} ms")
293+
if args.benchmark:
294+
print(
295+
f"Inference latency of the TensorRT optimized backbone: {device_model.get_latency()} ms"
296+
)
218297

219298

220299
if __name__ == "__main__":

examples/diffusers/quantization/onnx_utils/export.py

Lines changed: 16 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,13 @@
7373
"pooled_projections": {0: "batch_size"},
7474
"sample": {0: "batch_size", 1: "num_channels", 2: "height", 3: "width"},
7575
},
76+
"sd3.5-medium": {
77+
"hidden_states": {0: "batch_size", 1: "num_channels", 2: "height", 3: "width"},
78+
"timestep": {0: "steps"},
79+
"encoder_hidden_states": {0: "batch_size", 1: "sequence_length"},
80+
"pooled_projections": {0: "batch_size"},
81+
"out_hidden_states": {0: "batch_size", 1: "num_channels", 2: "height", 3: "width"},
82+
},
7683
"flux-dev": {
7784
"hidden_states": {0: "batch_size", 1: "latent_dim"},
7885
"encoder_hidden_states": {0: "batch_size"},
@@ -290,6 +297,8 @@ def update_dynamic_axes(model_id, dynamic_axes):
290297
dynamic_axes["out.0"] = dynamic_axes.pop("latent")
291298
elif model_id == "sd3-medium":
292299
dynamic_axes["out.0"] = dynamic_axes.pop("sample")
300+
elif model_id == "sd3.5-medium":
301+
dynamic_axes["out.0"] = dynamic_axes.pop("out_hidden_states")
293302

294303

295304
def _create_dynamic_shapes(dynamic_shapes):
@@ -313,7 +322,7 @@ def generate_dummy_inputs_and_dynamic_axes_and_shapes(model_id, backbone):
313322
dummy_input, dynamic_shapes = _gen_dummy_inp_and_dyn_shapes_sdxl(
314323
backbone, min_bs=2, opt_bs=16
315324
)
316-
elif model_id == "sd3-medium":
325+
elif model_id in ["sd3-medium", "sd3.5-medium"]:
317326
dummy_input, dynamic_shapes = _gen_dummy_inp_and_dyn_shapes_sd3(
318327
backbone, min_bs=2, opt_bs=16
319328
)
@@ -343,14 +352,16 @@ def get_io_shapes(model_id, onnx_load_path, dynamic_shapes):
343352
output_name = "latent"
344353
elif model_id in ["sd3-medium"]:
345354
output_name = "sample"
355+
elif model_id in ["sd3.5-medium"]:
356+
output_name = "out_hidden_states"
346357
elif model_id in ["flux-dev", "flux-schnell"]:
347358
output_name = "output"
348359
else:
349360
raise NotImplementedError(f"Unsupported model_id: {model_id}")
350361

351362
if model_id in ["sdxl-1.0", "sdxl-turbo"]:
352363
io_shapes = {output_name: dynamic_shapes["dynamic_shapes"]["minShapes"]["sample"]}
353-
elif model_id in ["sd3-medium"]:
364+
elif model_id in ["sd3-medium", "sd3.5-medium"]:
354365
io_shapes = {output_name: dynamic_shapes["dynamic_shapes"]["minShapes"]["hidden_states"]}
355366
elif model_id in ["flux-dev", "flux-schnell"]:
356367
io_shapes = {}
@@ -406,6 +417,9 @@ def modelopt_export_sd(backbone, onnx_dir, model_name, precision):
406417
elif model_name == "sd3-medium":
407418
input_names = ["hidden_states", "encoder_hidden_states", "pooled_projections", "timestep"]
408419
output_names = ["sample"]
420+
elif model_name == "sd3.5-medium":
421+
input_names = ["hidden_states", "encoder_hidden_states", "pooled_projections", "timestep"]
422+
output_names = ["out_hidden_states"]
409423
elif model_name in ["flux-dev", "flux-schnell"]:
410424
input_names = [
411425
"hidden_states",

examples/diffusers/quantization/quantize.py

Lines changed: 13 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@
1616
import argparse
1717
import logging
1818
import sys
19+
import time as time
1920
from collections.abc import Callable
2021
from dataclasses import dataclass
2122
from enum import Enum
@@ -59,6 +60,7 @@ class ModelType(str, Enum):
5960
SDXL_BASE = "sdxl-1.0"
6061
SDXL_TURBO = "sdxl-turbo"
6162
SD3_MEDIUM = "sd3-medium"
63+
SD35_MEDIUM = "sd3.5-medium"
6264
FLUX_DEV = "flux-dev"
6365
FLUX_SCHNELL = "flux-schnell"
6466
LTX_VIDEO_DEV = "ltx-video-dev"
@@ -114,6 +116,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
114116
ModelType.SDXL_BASE: filter_func_default,
115117
ModelType.SDXL_TURBO: filter_func_default,
116118
ModelType.SD3_MEDIUM: filter_func_default,
119+
ModelType.SD35_MEDIUM: filter_func_default,
117120
ModelType.LTX_VIDEO_DEV: filter_func_ltx_video,
118121
}
119122

@@ -125,6 +128,7 @@ def get_model_filter_func(model_type: ModelType) -> Callable[[str], bool]:
125128
ModelType.SDXL_BASE: "stabilityai/stable-diffusion-xl-base-1.0",
126129
ModelType.SDXL_TURBO: "stabilityai/sdxl-turbo",
127130
ModelType.SD3_MEDIUM: "stabilityai/stable-diffusion-3-medium-diffusers",
131+
ModelType.SD35_MEDIUM: "stabilityai/stable-diffusion-3.5-medium",
128132
ModelType.FLUX_DEV: "black-forest-labs/FLUX.1-dev",
129133
ModelType.FLUX_SCHNELL: "black-forest-labs/FLUX.1-schnell",
130134
ModelType.LTX_VIDEO_DEV: "Lightricks/LTX-Video-0.9.7-dev",
@@ -230,6 +234,7 @@ def uses_transformer(self) -> bool:
230234
"""Check if model uses transformer backbone (vs UNet)."""
231235
return self.model_type in [
232236
ModelType.SD3_MEDIUM,
237+
ModelType.SD35_MEDIUM,
233238
ModelType.FLUX_DEV,
234239
ModelType.FLUX_SCHNELL,
235240
ModelType.LTX_VIDEO_DEV,
@@ -326,7 +331,7 @@ def create_pipeline_from(
326331
model_id = (
327332
MODEL_REGISTRY[model_type] if override_model_path is None else override_model_path
328333
)
329-
if model_type == ModelType.SD3_MEDIUM:
334+
if model_type in [ModelType.SD3_MEDIUM, ModelType.SD35_MEDIUM]:
330335
pipe = StableDiffusion3Pipeline.from_pretrained(model_id, torch_dtype=torch_dtype)
331336
elif model_type in [ModelType.FLUX_DEV, ModelType.FLUX_SCHNELL]:
332337
pipe = FluxPipeline.from_pretrained(model_id, torch_dtype=torch_dtype)
@@ -357,7 +362,7 @@ def create_pipeline(self) -> DiffusionPipeline:
357362
self.logger.info(f"Data type: {self.config.model_dtype.value}")
358363

359364
try:
360-
if self.config.model_type == ModelType.SD3_MEDIUM:
365+
if self.config.model_type in [ModelType.SD3_MEDIUM, ModelType.SD35_MEDIUM]:
361366
self.pipe = StableDiffusion3Pipeline.from_pretrained(
362367
self.config.model_path, torch_dtype=self.config.torch_dtype
363368
)
@@ -864,6 +869,8 @@ def main() -> None:
864869
parser = create_argument_parser()
865870
args = parser.parse_args()
866871

872+
s = time.time()
873+
867874
logger = setup_logging(args.verbose)
868875
logger.info("Starting Enhanced Diffusion Model Quantization")
869876

@@ -939,9 +946,11 @@ def forward_loop(mod):
939946
backbone,
940947
model_config.model_type,
941948
quant_config.format,
942-
quantize_mha=QuantizationConfig.quantize_mha,
949+
quantize_mha=quant_config.quantize_mha,
950+
)
951+
logger.info(
952+
f"Quantization process completed successfully! Time taken = {time.time() - s} seconds"
943953
)
944-
logger.info("Quantization process completed successfully!")
945954

946955
except Exception as e:
947956
logger.error(f"Quantization failed: {e}", exc_info=True)

examples/llm_ptq/README.md

Lines changed: 32 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -235,6 +235,38 @@ with init_quantized_weights(mtq.NVFP4_DEFAULT_CFG):
235235
mtq.calibrate(model, algorithm="max", forward_loop=calibrate_loop)
236236
```
237237

238+
## Multi-Node Post-Training Quantization with FSDP2
239+
240+
ModelOpt enables quantization of LLMs across multiple GPU nodes using various quantization formats. It leverages HuggingFace's Accelerate library and FSDP2 for distributed model sharding and calibration.
241+
242+
### Usage
243+
244+
For distributed execution across multiple nodes, use the `accelerate` library. A template configuration file (`fsdp2.yaml`) is provided and can be customized for user specific requirements.
245+
246+
On each node run the following command:
247+
248+
```bash
249+
accelerate launch --config_file fsdp2.yaml \
250+
--num_machines=<num_nodes> \
251+
--machine_rank=<current_node_rank> \
252+
--main_process_ip=<node0_ip_addr> \
253+
--main_process_port=<port> \
254+
--fsdp_transformer_layer_cls_to_wrap=<decoder_layer_name>
255+
multinode_ptq.py \
256+
--pyt_ckpt_path <path_to_model> \
257+
--qformat <fp8/nvfp4/nvfp4_awq/int8> \
258+
--kv_cache_qformat <fp8/nvfp4/nvfp4_affine/none> \
259+
--batch_size <calib_batch_size> \
260+
--calib_size <num_calib_samples> \
261+
--dataset <dataset> \
262+
--export_path <export_path> \
263+
--trust_remote_code
264+
```
265+
266+
The exported checkpoint can be deployed using TensorRT-LLM/ vLLM/ SGLang. For more details refer to the [deployment section](#deployment) of this document.
267+
268+
> *Performance Note: FSDP2 is designed for training workloads and may result in longer calibration and export times. For faster calibration, maximize the batch size based on available GPU memory and choose the right number of GPUs to avoid unnecessary communication.*
269+
>
238270
## Framework Scripts
239271

240272
### Hugging Face Example [Script](./scripts/huggingface_example.sh)

0 commit comments

Comments
 (0)