Add option to benchmark pipeline in diffusion_trt.py #457

ajrasane · 2025-10-21T21:26:39Z

What does this PR do?

Type of change:
Example update

Overview:

Added a flag to benchmark the torch and device model
Added a flag to skip image generation
Added a flag to use torch.compile() for BF16, FP16 and FP32 precisions
Update get_onnx_bytes_and_metadata() API to remove redundant external data for ONNX models

Usage

Benchmark Flux model torch

python diffusion_trt.py --model flux-dev --torch --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --torch --benchmark --model-dtype Half --skip-image

Benchmark Flux model torch compile

python diffusion_trt.py --model flux-dev --torch --torch-compile --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --torch --torch-compile --benchmark --model-dtype Half --skip-image

Benchmark Flux model TRT

python diffusion_trt.py --model flux-dev --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --benchmark --model-dtype Half --skip-image

Testing

Flux backbone latencies:

Framework	BF16 (ms)	FP16 (ms)
torch	153.32	153.75
torch.compile()	85.77	87.05
TensorRT	65.4427	70.1626

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed.
Is this change backward compatible?: Yes
Did you write any new necessary tests?: No
Did you add or update any necessary documentation?: No
Did you update Changelog?: No

Summary by CodeRabbit

New Features
- Added model benchmarking capability to measure inference latency during pipeline execution
- Added CLI flags to enable benchmarking and optionally skip image generation during tests
Documentation
- Enhanced CLI help text for improved clarity and TensorRT-specific wording when benchmarking
Refactor
- Improved ONNX external-data handling and temporary model file management for more reliable export/import workflows

Signed-off-by: ajrasane <[email protected]>

copy-pr-bot · 2025-10-21T21:26:42Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2025-10-21T21:26:50Z

Walkthrough

Adds a CUDA-timed backbone benchmarking function with new CLI flags and optional torch.compile() for diffusion inference; refactors ONNX external-data handling by introducing helpers to detect/list external tensors, switching to per-model temporary ONNX directories, updated save/load calls, and adjusted cleanup logic.

Changes

Cohort / File(s)	Summary
Diffusion benchmarking & CLI `examples/diffusers/quantization/diffusion_trt.py`	Adds `benchmark_model(pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half")` that attaches forward hooks and uses CUDA events to measure backbone latency; introduces CLI flags `--benchmark`, `--torch-compile`, and `--skip-image`; integrates optional `torch.compile()` on the backbone; conditional image generation and updated help/latency messaging referencing TensorRT.
ONNX external-data handling `modelopt/torch/_deploy/utils/torch_onnx.py`	Replaces inline external-data detection with `has_external_data` and adds `get_external_tensor_paths(model_dir: str) -> list[str]`; switches to `infer_shapes(...)`; uses `tempfile.mkdtemp()` for per-model temporary ONNX dirs; writes external-data with `convert_attribute=False`; updates load/save and cleanup to remove the new temp directory and external tensor files appropriately.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Pipeline as Model Pipeline
    participant Benchmark as benchmark_model()
    participant CUDA as CUDA Events

    User->>CLI: run script (flags: --benchmark / --skip-image / --torch-compile)
    CLI->>CLI: parse args
    alt torch compile requested
        CLI->>Pipeline: apply torch.compile() to backbone
    end
    alt --benchmark set
        CLI->>Benchmark: call benchmark_model(pipe, prompt, ...)
        Benchmark->>Pipeline: attach forward hooks to backbone
        Benchmark->>CUDA: record warmup start/end
        loop warmup (num_warmup)
            Pipeline->>CUDA: forward (start/end events)
        end
        Benchmark->>CUDA: record timed runs start/end
        loop timed runs (num_runs)
            Pipeline->>CUDA: forward (start/end events)
        end
        Benchmark->>CLI: return average latency
        CLI->>User: print latency (TensorRT wording if applicable)
    end
    alt --skip-image not set
        CLI->>Pipeline: generate and save image
    else --skip-image set
        CLI->>CLI: skip image generation
    end

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I timed my hops with CUDA light,
Warmups, runs through day and night,
External tensors tucked in rows,
Temp dirs gone where the soft wind blows,
A little rabbit hums — benchmark bright. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The PR title "Add option to benchmark pipeline in diffusion_trt.py" is directly related to the primary changes in the pull request. The raw summary confirms that the main modifications involve adding a new `benchmark_model()` function and integrating benchmarking into the CLI workflow with a `--benchmark` flag, which aligns well with the title's focus. The title is concise, clear, and specific to the file being modified, making it immediately understandable to a reviewer scanning the commit history. While the PR also includes other flags like `--torch-compile` and `--skip-image`, focusing on the primary benchmarking feature is appropriate per the guidelines that titles don't need to cover every detail.
Description Check	✅ Passed	Check skipped - CodeRabbit’s high-level summary is enabled.

✨ Finishing touches

📝 Generate docstrings

🧪 Generate unit tests (beta)

Create PR with unit tests
Post copyable unit tests in a comment
Commit unit tests in branch ajrasane/diffusers_benchmarking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

Signed-off-by: ajrasane <[email protected]>

codecov · 2025-10-21T22:01:04Z

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.37%. Comparing base (f8a9353) to head (779be79).
⚠️ Report is 14 commits behind head on main.

Files with missing lines	Patch %	Lines
modelopt/torch/_deploy/utils/torch_onnx.py	60.00%	4 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main     #457   +/-   ##
=======================================
  Coverage   73.37%   73.37%           
=======================================
  Files         180      180           
  Lines       17937    17980   +43     
=======================================
+ Hits        13161    13193   +32     
- Misses       4776     4787   +11

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: ajrasane <[email protected]>

cjluo-nv · 2025-10-21T23:02:43Z

examples/diffusers/quantization/diffusion_trt.py

+        module._start_time = time.time()
+
+    def forward_hook(module, input, output):
+        torch.cuda.synchronize()


do you feel it will be more valuable to show the GPU time using cuda event instead of the CPU time?

With the GPU time you don't need to call explicit sync

cjluo-nv · 2025-10-21T23:03:22Z

examples/diffusers/quantization/diffusion_trt.py

+    def forward_hook(module, input, output):
+        torch.cuda.synchronize()
+        module._end_time = time.time()
+        backbone_times.append((module._end_time - module._start_time) * 1000)  # Convert to ms


I think it's also fine just do a total time / # of runs. With that maybe you don't need these hooks

The total time will also record the time of the other components of the pipeline like the encoder and decoder. Hence I have added these hooks only to the backbone as it takes up majority of the pipeline time.

Signed-off-by: ajrasane <[email protected]>

coderabbitai

Actionable comments posted: 0

🧹 Nitpick comments (1)

examples/diffusers/quantization/diffusion_trt.py (1)
62-108: Solid implementation with good timing methodology.

The CUDA event-based timing correctly measures GPU latency (addressing past feedback), and the warmup/benchmark structure follows best practices.

Consider enhancing the docstring to clarify that the returned latency is per backbone inference step, not per complete image generation:
 def benchmark_model(pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20):
-    """Benchmark the backbone model inference time."""
+    """Benchmark the backbone model inference time.
+    
+    Returns the average latency per backbone inference step (in milliseconds).
+    Note: Each full image generation involves num_inference_steps backbone calls.
+    """
     backbone = pipe.transformer if hasattr(pipe, "transformer") else pipe.unet
Optionally, add an explicit check after line 64:
if backbone is None:
    raise ValueError("Pipeline does not have a transformer or unet backbone")
Though this is already validated elsewhere, defensive checks can improve robustness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff8a1ed and 89d66d8.

📒 Files selected for processing (1)

examples/diffusers/quantization/diffusion_trt.py (5 hunks)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)

GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (4)

examples/diffusers/quantization/diffusion_trt.py (4)

26-26: LGTM! Good UX improvement.

The tqdm import enables progress bars during warmup and benchmark runs, improving user experience during long-running operations.

145-145: LGTM! Clear flag documentation.

The help text updates and new flags are well-documented and follow CLI conventions.

Also applies to: 151-159

191-195: LGTM! Correct conditional flow.

The torch path correctly handles both --benchmark and --skip-image flags with appropriate conditional logic.

275-282: ****

The critical issue about get_latency() returning invalid data is based on incorrect assumptions. get_latency() calls _profile_device(), which invokes self.client.profile() to actively run profiling/inference—it does not rely on cached latency from prior model executions. Therefore, when both --skip-image and --benchmark flags are set, get_latency() will correctly measure and return latency even though device_model wasn't used in the image generation step.

The only legitimate issue is the duplicate print statement: generate_image() already prints "Image generated saved as {image_name}", making line 277's identical print redundant when --skip-image is not set.

Likely an incorrect or invalid review comment.

Signed-off-by: ajrasane <[email protected]>

coderabbitai

Actionable comments posted: 2

🧹 Nitpick comments (1)

modelopt/torch/_deploy/utils/torch_onnx.py (1)

519-526: Make external tensor enumeration model-aware to avoid accidental deletions.

Enumerating “everything not .onnx” in a directory is brittle. Derive paths from external_data entries in the ONNX file to delete only what the graph actually references.

Apply this diff to replace the helper:

-def get_external_tensor_paths(model_dir: str) -> list[str]:
-    """Get the paths of the external data tensors in the model."""
-    return [
-        os.path.join(model_dir, file)
-        for file in os.listdir(model_dir)
-        if not file.endswith(".onnx")
-    ]
+def get_external_tensor_paths_from_model(onnx_model_path: str) -> list[str]:
+    """Return filesystem paths of external-data tensors referenced by the model."""
+    model = onnx.load(onnx_model_path, load_external_data=False)
+    base_dir = os.path.dirname(os.path.abspath(onnx_model_path))
+    paths: set[str] = set()
+    for tensor in model.graph.initializer:
+        # external_data is a list of key/value entries; 'location' holds the filename
+        for entry in tensor.external_data:
+            if entry.key == "location" and entry.value:
+                paths.add(os.path.join(base_dir, entry.value))
+    return sorted(paths)

Note: The call site above was updated to use get_external_tensor_paths_from_model(onnx_save_path).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89d66d8 and b5223b1.

📒 Files selected for processing (1)

modelopt/torch/_deploy/utils/torch_onnx.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

modelopt/torch/_deploy/utils/torch_onnx.py (1)

modelopt/onnx/utils.py (1)

infer_shapes (723-736)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: code-quality
GitHub Check: build-docs

🔇 Additional comments (3)

modelopt/torch/_deploy/utils/torch_onnx.py (3)
447-448: Good swap to centralized shape inference.

Using modelopt.onnx.utils.infer_shapes() handles >2GB graphs via infer_shapes_path and keeps logic consistent with the rest of the codebase.

505-506: Confirm ONNX save_model supports convert_attribute in all target envs.

Repo context lists onnx==1.19.0, which should support convert_attribute, but CI/user envs may lag. If not present, this call raises TypeError.

Please verify compatibility in your CI/runtime. If needed, gate the arg via signature check:
import inspect, onnx
kwargs = dict(
    save_as_external_data=True,
    all_tensors_to_one_file=True,
    location=f"{model_name}.onnx_data",
    size_threshold=1024,
)
if "convert_attribute" in inspect.signature(onnx.save_model).parameters:
    kwargs["convert_attribute"] = False
onnx.save_model(onnx_opt_graph, onnx_save_path, **kwargs)
88-97: External-data load path logic looks solid.

has_external_data() gate and single .onnx enforcement in the directory prevent ambiguity when bundling external files.

coderabbitai · 2025-10-22T03:30:09Z

modelopt/torch/_deploy/utils/torch_onnx.py

+    onnx_path = tempfile.mkdtemp(prefix=f"modelopt_{model_name}_")
    onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")



🛠️ Refactor suggestion | 🟠 Major

Ensure temp directory cleanup on exceptions to prevent GB-sized leaks.

If any step between export and OnnxBytes() raises, onnx_path persists. Use try/finally (or TemporaryDirectory) to always rmtree with ignore_errors=True.

Example pattern:

- onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx") + onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx") + try: + # ... export, optimize, save, build onnx_bytes ... + onnx_bytes = OnnxBytes(onnx_save_path) + result_bytes = onnx_bytes.to_bytes() + finally: + if remove_exported_model: + shutil.rmtree(onnx_path, ignore_errors=True) - onnx_bytes = OnnxBytes(onnx_save_path) - if remove_exported_model: - shutil.rmtree(onnx_path) - return onnx_bytes.to_bytes(), model_metadata + return result_bytes, model_metadata

Also applies to: 514-516

🤖 Prompt for AI Agents

In modelopt/torch/_deploy/utils/torch_onnx.py around lines 416-418 (and likewise 514-516), the temporary directory created with tempfile.mkdtemp can be leaked if an exception occurs; wrap the creation+export+OnnxBytes sequence in a try/finally (or replace mkdtemp with tempfile.TemporaryDirectory as a context manager) so that in the finally block you call shutil.rmtree(onnx_path, ignore_errors=True); ensure you produce the ONNX bytes (or otherwise read any files needed) before the cleanup so the function still returns the expected data.

coderabbitai · 2025-10-22T03:30:09Z

modelopt/torch/_deploy/utils/torch_onnx.py

    # If the onnx model contains external data store the external tensors in one file and save the onnx model
    if has_external_data(onnx_save_path):
-        tensor_paths = _get_onnx_external_data_tensors(onnx_opt_graph)
+        tensor_paths = get_external_tensor_paths(onnx_path)
        onnx.save_model(
            onnx_opt_graph,
            onnx_save_path,
            save_as_external_data=True,
            all_tensors_to_one_file=True,
            location=f"{model_name}.onnx_data",
            size_threshold=1024,
+            convert_attribute=False,
        )
-        for tensor in tensor_paths:
-            tensor_path = os.path.join(onnx_path, tensor)
-            os.remove(tensor_path)
+        for path in tensor_paths:
+            os.remove(path)
    else:
        onnx.save_model(onnx_opt_graph, onnx_save_path)



⚠️ Potential issue | 🔴 Critical

Decide external-data save based on the optimized ModelProto (not the pre-export file).

Current check uses has_external_data(onnx_save_path), which reflects the pre-optimization export. If optimization changes size/layout, you can incorrectly save a >2GB model without external data (protobuf limit), or conversely over-constrain saving. Compute this from onnx_opt_graph instead and fall back to ByteSize guard.

Apply this diff:

- # If the onnx model contains external data store the external tensors in one file and save the onnx model - if has_external_data(onnx_save_path): - tensor_paths = get_external_tensor_paths(onnx_path) - onnx.save_model( - onnx_opt_graph, - onnx_save_path, - save_as_external_data=True, - all_tensors_to_one_file=True, - location=f"{model_name}.onnx_data", - size_threshold=1024, - convert_attribute=False, - ) - for path in tensor_paths: - os.remove(path) - else: - onnx.save_model(onnx_opt_graph, onnx_save_path) + # Decide external-data save from the optimized graph to avoid 2GB protobuf issues. + needs_external = ( + check_model_uses_external_data(onnx_opt_graph) or onnx_opt_graph.ByteSize() > TWO_GB + ) + if needs_external: + old_tensor_paths = get_external_tensor_paths_from_model(onnx_save_path) + onnx.save_model( + onnx_opt_graph, + onnx_save_path, + save_as_external_data=True, + all_tensors_to_one_file=True, + location=f"{model_name}.onnx_data", + size_threshold=1024, + convert_attribute=False, + ) + from contextlib import suppress # safe local import if not at top + for path in old_tensor_paths: + with suppress(FileNotFoundError): + os.remove(path) + else: + onnx.save_model(onnx_opt_graph, onnx_save_path)

Committable suggestion skipped: line range outside the PR's diff.

cjluo-nv · 2025-10-22T04:00:11Z

examples/diffusers/quantization/diffusion_trt.py

+
+    def forward_hook(_module, _input, _output):
+        end_event.record()
+        torch.cuda.synchronize()


you can use even synchronize.

coderabbitai

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5223b1 and 99a7e66.

📒 Files selected for processing (1)

examples/diffusers/quantization/diffusion_trt.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/diffusers/quantization/diffusion_trt.py (2)

examples/diffusers/cache_diffusion/pipeline/deploy.py (1)

compile (202-213)

modelopt/torch/quantization/qtensor/base_qtensor.py (1)

to (115-123)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: linux
GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (6)

examples/diffusers/quantization/diffusion_trt.py (6)

26-26: LGTM!

The tqdm import is appropriate for displaying progress during warmup and benchmark runs.

149-149: LGTM!

Help text update accurately describes the parameter.

155-166: LGTM!

The new CLI flags are well-defined and the help text clearly describes their purpose.

204-210: LGTM!

The conditional benchmarking and image generation logic is correct. The early return properly prevents execution of the TensorRT path.

289-296: Remove this review comment - it is based on incorrect assumptions about the implementation.

The original concern assumes that device_model.get_latency() accumulates timing data from prior inference calls. However, the actual implementation shows:

get_latency() calls _profile_device() which independently runs self.client.profile(compiled_model=...)

This is an active profiling operation, not accumulation of cached data

The return type is float (not Optional[float]), confirming it always returns valid data

When --skip-image and --benchmark are both specified, the code works correctly:

Image generation is skipped (line 289–291)

get_latency() is called (line 293–296)

Profiling runs independently and returns valid latency metrics

No code changes needed.

Likely an incorrect or invalid review comment.

62-113: No critical issue found — the average latency calculation is correct.

The calculation on line 110 divides by (num_runs * num_inference_steps) based on the assumption that the backbone is called exactly once per inference step. For the supported pipeline models (SDXL, SD3, Flux), this assumption holds: in standard diffusers implementations, the UNet/Transformer is invoked once during each denoising step and not elsewhere (text encoders are separate modules). The hooks correctly capture all backbone forward passes, and the division yields the accurate per-step latency.

Note: This calculation is contingent on the diffusers library behavior. If the pipeline implementation changes or a different model is used with multiple backbone calls per step, the calculation would need adjustment. Consider documenting this assumption or adding a verification counter to log actual call counts.

coderabbitai · 2025-10-22T06:24:45Z

examples/diffusers/quantization/diffusion_trt.py

+    if args.torch_compile:
+        assert args.model_dtype in ["BFloat16", "Float"], (
+            "torch.compile() only supports BFloat16 and Float"
+        )
+        print("Compiling backbone with torch.compile()...")
+        backbone = torch.compile(backbone)


⚠️ Potential issue | 🔴 Critical

Move torch.compile() inside the torch path block.

This code executes torch.compile() regardless of whether --torch is specified. When using the TensorRT path (without --torch), compiling the backbone serves no purpose since it gets converted to a TRT engine later. This is both incorrect and wasteful.

Move this block inside the if args.torch: block (after line 198):

if args.restore_from: mto.restore(backbone, args.restore_from) - if args.torch_compile: - assert args.model_dtype in ["BFloat16", "Float"], ( - "torch.compile() only supports BFloat16 and Float" - ) - print("Compiling backbone with torch.compile()...") - backbone = torch.compile(backbone) - if args.torch: + if args.torch_compile: + assert args.model_dtype in ["BFloat16", "Float"], ( + "torch.compile() only supports BFloat16 and Float" + ) + print("Compiling backbone with torch.compile()...") + backbone = torch.compile(backbone) + if hasattr(pipe, "transformer"): pipe.transformer = backbone elif hasattr(pipe, "unet"):

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if args.torch_compile:

assert args.model_dtype in ["BFloat16", "Float"], (

"torch.compile() only supports BFloat16 and Float"

)

print("Compiling backbone with torch.compile()...")

backbone = torch.compile(backbone)

if args.restore_from:

mto.restore(backbone, args.restore_from)

if args.torch:

if args.torch_compile:

assert args.model_dtype in ["BFloat16", "Float"], (

"torch.compile() only supports BFloat16 and Float"

)

print("Compiling backbone with torch.compile()...")

backbone = torch.compile(backbone)

if hasattr(pipe, "transformer"):

pipe.transformer = backbone

elif hasattr(pipe, "unet"):

pipe.unet = backbone

🤖 Prompt for AI Agents

In examples/diffusers/quantization/diffusion_trt.py around lines 191-196, the torch.compile() block runs unconditionally even when using the TensorRT path; remove or cut the if args.torch_compile: ... backbone = torch.compile(backbone) block from here and paste it inside the existing if args.torch: block (after line 198) so compilation only happens for the torch execution path; keep the model_dtype assertion and print message, and ensure indentation matches the torch block.

Signed-off-by: ajrasane <[email protected]>

coderabbitai

Actionable comments posted: 0

♻️ Duplicate comments (1)

examples/diffusers/quantization/diffusion_trt.py (1)
191-196: Move torch.compile() inside the torch path block.

This block executes unconditionally, even when using the TensorRT path. When --torch is not specified, the backbone gets compiled and then immediately converted to a TensorRT engine, wasting time and resources.

Additionally, the assertion at line 192 now includes "Half" (float16), but torch.compile() may not fully support float16 on all operations.

Move this block inside the if args.torch: block (after line 198):
     if args.restore_from:
         mto.restore(backbone, args.restore_from)
 
-    if args.torch_compile:
-        assert args.model_dtype in ["BFloat16", "Float", "Half"], (
-            "torch.compile() only supports BFloat16 and Float"
-        )
-        print("Compiling backbone with torch.compile()...")
-        backbone = torch.compile(backbone, mode="max-autotune")
-
     if args.torch:
+        if args.torch_compile:
+            assert args.model_dtype in ["BFloat16", "Float"], (
+                "torch.compile() only supports BFloat16 and Float"
+            )
+            print("Compiling backbone with torch.compile()...")
+            backbone = torch.compile(backbone, mode="max-autotune")
+
         if hasattr(pipe, "transformer"):
             pipe.transformer = backbone

🧹 Nitpick comments (1)

examples/diffusers/quantization/diffusion_trt.py (1)
62-113: Consider making the benchmark message backend-agnostic.

The message at line 111 hardcodes "torch backbone", which is accurate for the current usage in the torch path. However, if this function is ever reused for other backends (e.g., TensorRT-optimized), the message would be misleading.

Consider adding a parameter to specify the backend name:
 def benchmark_model(
-    pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half"
+    pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half", backend_name="torch"
 ):
     """Benchmark the backbone model inference time."""
     ...
-    print(f"Inference latency of the torch backbone: {avg_latency:.2f} ms")
+    print(f"Inference latency of the {backend_name} backbone: {avg_latency:.2f} ms")
     return avg_latency

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99a7e66 and 779be79.

📒 Files selected for processing (1)

examples/diffusers/quantization/diffusion_trt.py (5 hunks)

🧰 Additional context used

🧬 Code graph analysis (1)

examples/diffusers/quantization/diffusion_trt.py (1)

examples/diffusers/cache_diffusion/pipeline/deploy.py (1)

compile (202-213)

⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)

GitHub Check: wait-checks / wait
GitHub Check: wait-checks / wait
GitHub Check: linux
GitHub Check: build-docs
GitHub Check: code-quality

🔇 Additional comments (1)

examples/diffusers/quantization/diffusion_trt.py (1)

289-296: ****

The review comment misunderstood how get_latency() works. It does not rely on prior inference data. Instead, get_latency() calls _profile_device(), which invokes client.profile() to independently profile the compiled model. This means calling get_latency() when --skip-image is specified works correctly and will produce valid latency measurements, regardless of whether image generation occurred.

The TensorRT and Torch paths use different profiling methodologies (TensorRT's client.profile() vs. explicit benchmark_model() with configurable warmup/runs), but both are valid approaches for latency measurement. No code changes are required.

Likely an incorrect or invalid review comment.

jingyu-ml

LGTM

Signed-off-by: ajrasane <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>

Add option to benchmark pipeline in diffusion_trt.py

ecaaf76

Signed-off-by: ajrasane <[email protected]>

ajrasane added 2 commits October 21, 2025 21:38

Benchmark the backbone only

6184567

Signed-off-by: ajrasane <[email protected]>

Add a flag for skipping image generation

fbafb67

Signed-off-by: ajrasane <[email protected]>

Update logging

33afe32

Signed-off-by: ajrasane <[email protected]>

ajrasane self-assigned this Oct 21, 2025

cjluo-nv reviewed Oct 21, 2025

View reviewed changes

Measure GPU time

89d66d8

Signed-off-by: ajrasane <[email protected]>

ajrasane marked this pull request as ready for review October 22, 2025 00:52

ajrasane requested a review from a team as a code owner October 22, 2025 00:52

ajrasane requested a review from jingyu-ml October 22, 2025 00:52

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

Update get_onnx_bytes_and_metadata

b5223b1

Signed-off-by: ajrasane <[email protected]>

ajrasane requested a review from a team as a code owner October 22, 2025 03:24

ajrasane requested a review from i-riyad October 22, 2025 03:24

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

cjluo-nv reviewed Oct 22, 2025

View reviewed changes

cjluo-nv approved these changes Oct 22, 2025

View reviewed changes

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

Add flag for torch.compile()

779be79

Signed-off-by: ajrasane <[email protected]>

ajrasane force-pushed the ajrasane/diffusers_benchmarking branch from 99a7e66 to 779be79 Compare October 22, 2025 06:56

coderabbitai bot reviewed Oct 22, 2025

View reviewed changes

jingyu-ml approved these changes Oct 22, 2025

View reviewed changes

ajrasane merged commit 2fd67cc into main Oct 22, 2025
27 checks passed

ajrasane deleted the ajrasane/diffusers_benchmarking branch October 22, 2025 22:32

Edwardf0t1 pushed a commit that referenced this pull request Oct 23, 2025

Add option to benchmark pipeline in diffusion_trt.py (#457)

d2f7479

Signed-off-by: ajrasane <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>

Edwardf0t1 pushed a commit that referenced this pull request Oct 24, 2025

Add option to benchmark pipeline in diffusion_trt.py (#457)

476b59f

Signed-off-by: ajrasane <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>

		onnx_path = tempfile.mkdtemp(prefix=f"modelopt_{model_name}_")
		onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")

Add option to benchmark pipeline in diffusion_trt.py #457

Add option to benchmark pipeline in diffusion_trt.py #457

Uh oh!

Conversation

ajrasane commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Usage

Testing

Before your PR is "Ready for review"

Summary by CodeRabbit

Uh oh!

copy-pr-bot bot commented Oct 21, 2025

Uh oh!

coderabbitai bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Walkthrough

Changes

Sequence Diagram(s)

Estimated code review effort

Poem

Pre-merge checks and finishing touches

Uh oh!

codecov bot commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

cjluo-nv Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

ajrasane Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ajrasane Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

cjluo-nv Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

coderabbitai bot left a comment

Choose a reason for hiding this comment

Uh oh!

jingyu-ml left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ajrasane commented Oct 21, 2025 •

edited

Loading

coderabbitai bot commented Oct 21, 2025 •

edited

Loading

codecov bot commented Oct 21, 2025 •

edited

Loading

cjluo-nv Oct 21, 2025 •

edited

Loading