Skip to content

Conversation

@ajrasane
Copy link
Contributor

@ajrasane ajrasane commented Oct 21, 2025

What does this PR do?

Type of change:
Example update

Overview:

  • Added a flag to benchmark the torch and device model
  • Added a flag to skip image generation
  • Added a flag to use torch.compile() for BF16, FP16 and FP32 precisions
  • Update get_onnx_bytes_and_metadata() API to remove redundant external data for ONNX models

Usage

Benchmark Flux model torch

python diffusion_trt.py --model flux-dev --torch --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --torch --benchmark --model-dtype Half --skip-image

Benchmark Flux model torch compile

python diffusion_trt.py --model flux-dev --torch --torch-compile --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --torch --torch-compile --benchmark --model-dtype Half --skip-image

Benchmark Flux model TRT

python diffusion_trt.py --model flux-dev --benchmark --model-dtype BFloat16 --skip-image
python diffusion_trt.py --model flux-dev --benchmark --model-dtype Half --skip-image

Testing

Flux backbone latencies:

Framework BF16 (ms) FP16 (ms)
torch 153.32 153.75
torch.compile() 85.77 87.05
TensorRT 65.4427 70.1626

Before your PR is "Ready for review"

  • Make sure you read and follow Contributor guidelines and your commits are signed.
  • Is this change backward compatible?: Yes
  • Did you write any new necessary tests?: No
  • Did you add or update any necessary documentation?: No
  • Did you update Changelog?: No

Summary by CodeRabbit

  • New Features

    • Added model benchmarking capability to measure inference latency during pipeline execution
    • Added CLI flags to enable benchmarking and optionally skip image generation during tests
  • Documentation

    • Enhanced CLI help text for improved clarity and TensorRT-specific wording when benchmarking
  • Refactor

    • Improved ONNX external-data handling and temporary model file management for more reliable export/import workflows

@copy-pr-bot
Copy link

copy-pr-bot bot commented Oct 21, 2025

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Contributor

coderabbitai bot commented Oct 21, 2025

Walkthrough

Adds a CUDA-timed backbone benchmarking function with new CLI flags and optional torch.compile() for diffusion inference; refactors ONNX external-data handling by introducing helpers to detect/list external tensors, switching to per-model temporary ONNX directories, updated save/load calls, and adjusted cleanup logic.

Changes

Cohort / File(s) Summary
Diffusion benchmarking & CLI
examples/diffusers/quantization/diffusion_trt.py
Adds benchmark_model(pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half") that attaches forward hooks and uses CUDA events to measure backbone latency; introduces CLI flags --benchmark, --torch-compile, and --skip-image; integrates optional torch.compile() on the backbone; conditional image generation and updated help/latency messaging referencing TensorRT.
ONNX external-data handling
modelopt/torch/_deploy/utils/torch_onnx.py
Replaces inline external-data detection with has_external_data and adds get_external_tensor_paths(model_dir: str) -> list[str]; switches to infer_shapes(...); uses tempfile.mkdtemp() for per-model temporary ONNX dirs; writes external-data with convert_attribute=False; updates load/save and cleanup to remove the new temp directory and external tensor files appropriately.

Sequence Diagram(s)

sequenceDiagram
    participant User
    participant CLI
    participant Pipeline as Model Pipeline
    participant Benchmark as benchmark_model()
    participant CUDA as CUDA Events

    User->>CLI: run script (flags: --benchmark / --skip-image / --torch-compile)
    CLI->>CLI: parse args
    alt torch compile requested
        CLI->>Pipeline: apply torch.compile() to backbone
    end
    alt --benchmark set
        CLI->>Benchmark: call benchmark_model(pipe, prompt, ...)
        Benchmark->>Pipeline: attach forward hooks to backbone
        Benchmark->>CUDA: record warmup start/end
        loop warmup (num_warmup)
            Pipeline->>CUDA: forward (start/end events)
        end
        Benchmark->>CUDA: record timed runs start/end
        loop timed runs (num_runs)
            Pipeline->>CUDA: forward (start/end events)
        end
        Benchmark->>CLI: return average latency
        CLI->>User: print latency (TensorRT wording if applicable)
    end
    alt --skip-image not set
        CLI->>Pipeline: generate and save image
    else --skip-image set
        CLI->>CLI: skip image generation
    end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I timed my hops with CUDA light,
Warmups, runs through day and night,
External tensors tucked in rows,
Temp dirs gone where the soft wind blows,
A little rabbit hums — benchmark bright. 🥕

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 20.00% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The PR title "Add option to benchmark pipeline in diffusion_trt.py" is directly related to the primary changes in the pull request. The raw summary confirms that the main modifications involve adding a new benchmark_model() function and integrating benchmarking into the CLI workflow with a --benchmark flag, which aligns well with the title's focus. The title is concise, clear, and specific to the file being modified, making it immediately understandable to a reviewer scanning the commit history. While the PR also includes other flags like --torch-compile and --skip-image, focusing on the primary benchmarking feature is appropriate per the guidelines that titles don't need to cover every detail.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing touches
  • 📝 Generate docstrings
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Post copyable unit tests in a comment
  • Commit unit tests in branch ajrasane/diffusers_benchmarking

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link

codecov bot commented Oct 21, 2025

Codecov Report

❌ Patch coverage is 60.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 73.37%. Comparing base (f8a9353) to head (779be79).
⚠️ Report is 14 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/_deploy/utils/torch_onnx.py 60.00% 4 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main     #457   +/-   ##
=======================================
  Coverage   73.37%   73.37%           
=======================================
  Files         180      180           
  Lines       17937    17980   +43     
=======================================
+ Hits        13161    13193   +32     
- Misses       4776     4787   +11     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: ajrasane <[email protected]>
@ajrasane ajrasane self-assigned this Oct 21, 2025
module._start_time = time.time()

def forward_hook(module, input, output):
torch.cuda.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you feel it will be more valuable to show the GPU time using cuda event instead of the CPU time?

With the GPU time you don't need to call explicit sync

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

def forward_hook(module, input, output):
torch.cuda.synchronize()
module._end_time = time.time()
backbone_times.append((module._end_time - module._start_time) * 1000) # Convert to ms
Copy link
Collaborator

@cjluo-nv cjluo-nv Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's also fine just do a total time / # of runs. With that maybe you don't need these hooks

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The total time will also record the time of the other components of the pipeline like the encoder and decoder. Hence I have added these hooks only to the backbone as it takes up majority of the pipeline time.

Signed-off-by: ajrasane <[email protected]>
@ajrasane ajrasane marked this pull request as ready for review October 22, 2025 00:52
@ajrasane ajrasane requested a review from a team as a code owner October 22, 2025 00:52
@ajrasane ajrasane requested a review from jingyu-ml October 22, 2025 00:52
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)

62-108: Solid implementation with good timing methodology.

The CUDA event-based timing correctly measures GPU latency (addressing past feedback), and the warmup/benchmark structure follows best practices.

Consider enhancing the docstring to clarify that the returned latency is per backbone inference step, not per complete image generation:

 def benchmark_model(pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20):
-    """Benchmark the backbone model inference time."""
+    """Benchmark the backbone model inference time.
+    
+    Returns the average latency per backbone inference step (in milliseconds).
+    Note: Each full image generation involves num_inference_steps backbone calls.
+    """
     backbone = pipe.transformer if hasattr(pipe, "transformer") else pipe.unet

Optionally, add an explicit check after line 64:

if backbone is None:
    raise ValueError("Pipeline does not have a transformer or unet backbone")

Though this is already validated elsewhere, defensive checks can improve robustness.

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between ff8a1ed and 89d66d8.

📒 Files selected for processing (1)
  • examples/diffusers/quantization/diffusion_trt.py (5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (4)
examples/diffusers/quantization/diffusion_trt.py (4)

26-26: LGTM! Good UX improvement.

The tqdm import enables progress bars during warmup and benchmark runs, improving user experience during long-running operations.


145-145: LGTM! Clear flag documentation.

The help text updates and new flags are well-documented and follow CLI conventions.

Also applies to: 151-159


191-195: LGTM! Correct conditional flow.

The torch path correctly handles both --benchmark and --skip-image flags with appropriate conditional logic.


275-282: ****

The critical issue about get_latency() returning invalid data is based on incorrect assumptions. get_latency() calls _profile_device(), which invokes self.client.profile() to actively run profiling/inference—it does not rely on cached latency from prior model executions. Therefore, when both --skip-image and --benchmark flags are set, get_latency() will correctly measure and return latency even though device_model wasn't used in the image generation step.

The only legitimate issue is the duplicate print statement: generate_image() already prints "Image generated saved as {image_name}", making line 277's identical print redundant when --skip-image is not set.

Likely an incorrect or invalid review comment.

@ajrasane ajrasane requested a review from a team as a code owner October 22, 2025 03:24
@ajrasane ajrasane requested a review from i-riyad October 22, 2025 03:24
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🧹 Nitpick comments (1)
modelopt/torch/_deploy/utils/torch_onnx.py (1)

519-526: Make external tensor enumeration model-aware to avoid accidental deletions.

Enumerating “everything not .onnx” in a directory is brittle. Derive paths from external_data entries in the ONNX file to delete only what the graph actually references.

Apply this diff to replace the helper:

-def get_external_tensor_paths(model_dir: str) -> list[str]:
-    """Get the paths of the external data tensors in the model."""
-    return [
-        os.path.join(model_dir, file)
-        for file in os.listdir(model_dir)
-        if not file.endswith(".onnx")
-    ]
+def get_external_tensor_paths_from_model(onnx_model_path: str) -> list[str]:
+    """Return filesystem paths of external-data tensors referenced by the model."""
+    model = onnx.load(onnx_model_path, load_external_data=False)
+    base_dir = os.path.dirname(os.path.abspath(onnx_model_path))
+    paths: set[str] = set()
+    for tensor in model.graph.initializer:
+        # external_data is a list of key/value entries; 'location' holds the filename
+        for entry in tensor.external_data:
+            if entry.key == "location" and entry.value:
+                paths.add(os.path.join(base_dir, entry.value))
+    return sorted(paths)

Note: The call site above was updated to use get_external_tensor_paths_from_model(onnx_save_path).

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 89d66d8 and b5223b1.

📒 Files selected for processing (1)
  • modelopt/torch/_deploy/utils/torch_onnx.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/_deploy/utils/torch_onnx.py (1)
modelopt/onnx/utils.py (1)
  • infer_shapes (723-736)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: code-quality
  • GitHub Check: build-docs
🔇 Additional comments (3)
modelopt/torch/_deploy/utils/torch_onnx.py (3)

447-448: Good swap to centralized shape inference.

Using modelopt.onnx.utils.infer_shapes() handles >2GB graphs via infer_shapes_path and keeps logic consistent with the rest of the codebase.


505-506: Confirm ONNX save_model supports convert_attribute in all target envs.

Repo context lists onnx==1.19.0, which should support convert_attribute, but CI/user envs may lag. If not present, this call raises TypeError.

Please verify compatibility in your CI/runtime. If needed, gate the arg via signature check:

import inspect, onnx
kwargs = dict(
    save_as_external_data=True,
    all_tensors_to_one_file=True,
    location=f"{model_name}.onnx_data",
    size_threshold=1024,
)
if "convert_attribute" in inspect.signature(onnx.save_model).parameters:
    kwargs["convert_attribute"] = False
onnx.save_model(onnx_opt_graph, onnx_save_path, **kwargs)

88-97: External-data load path logic looks solid.

has_external_data() gate and single .onnx enforcement in the directory prevent ambiguity when bundling external files.

Comment on lines +416 to 418
onnx_path = tempfile.mkdtemp(prefix=f"modelopt_{model_name}_")
onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Ensure temp directory cleanup on exceptions to prevent GB-sized leaks.

If any step between export and OnnxBytes() raises, onnx_path persists. Use try/finally (or TemporaryDirectory) to always rmtree with ignore_errors=True.

Example pattern:

-    onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")
+    onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")
+    try:
+        # ... export, optimize, save, build onnx_bytes ...
+        onnx_bytes = OnnxBytes(onnx_save_path)
+        result_bytes = onnx_bytes.to_bytes()
+    finally:
+        if remove_exported_model:
+            shutil.rmtree(onnx_path, ignore_errors=True)
-    onnx_bytes = OnnxBytes(onnx_save_path)
-    if remove_exported_model:
-        shutil.rmtree(onnx_path)
-    return onnx_bytes.to_bytes(), model_metadata
+    return result_bytes, model_metadata

Also applies to: 514-516

🤖 Prompt for AI Agents
In modelopt/torch/_deploy/utils/torch_onnx.py around lines 416-418 (and likewise
514-516), the temporary directory created with tempfile.mkdtemp can be leaked if
an exception occurs; wrap the creation+export+OnnxBytes sequence in a
try/finally (or replace mkdtemp with tempfile.TemporaryDirectory as a context
manager) so that in the finally block you call shutil.rmtree(onnx_path,
ignore_errors=True); ensure you produce the ONNX bytes (or otherwise read any
files needed) before the cleanup so the function still returns the expected
data.

Comment on lines 495 to 511
# If the onnx model contains external data store the external tensors in one file and save the onnx model
if has_external_data(onnx_save_path):
tensor_paths = _get_onnx_external_data_tensors(onnx_opt_graph)
tensor_paths = get_external_tensor_paths(onnx_path)
onnx.save_model(
onnx_opt_graph,
onnx_save_path,
save_as_external_data=True,
all_tensors_to_one_file=True,
location=f"{model_name}.onnx_data",
size_threshold=1024,
convert_attribute=False,
)
for tensor in tensor_paths:
tensor_path = os.path.join(onnx_path, tensor)
os.remove(tensor_path)
for path in tensor_paths:
os.remove(path)
else:
onnx.save_model(onnx_opt_graph, onnx_save_path)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Decide external-data save based on the optimized ModelProto (not the pre-export file).

Current check uses has_external_data(onnx_save_path), which reflects the pre-optimization export. If optimization changes size/layout, you can incorrectly save a >2GB model without external data (protobuf limit), or conversely over-constrain saving. Compute this from onnx_opt_graph instead and fall back to ByteSize guard.

Apply this diff:

-    # If the onnx model contains external data store the external tensors in one file and save the onnx model
-    if has_external_data(onnx_save_path):
-        tensor_paths = get_external_tensor_paths(onnx_path)
-        onnx.save_model(
-            onnx_opt_graph,
-            onnx_save_path,
-            save_as_external_data=True,
-            all_tensors_to_one_file=True,
-            location=f"{model_name}.onnx_data",
-            size_threshold=1024,
-            convert_attribute=False,
-        )
-        for path in tensor_paths:
-            os.remove(path)
-    else:
-        onnx.save_model(onnx_opt_graph, onnx_save_path)
+    # Decide external-data save from the optimized graph to avoid 2GB protobuf issues.
+    needs_external = (
+        check_model_uses_external_data(onnx_opt_graph) or onnx_opt_graph.ByteSize() > TWO_GB
+    )
+    if needs_external:
+        old_tensor_paths = get_external_tensor_paths_from_model(onnx_save_path)
+        onnx.save_model(
+            onnx_opt_graph,
+            onnx_save_path,
+            save_as_external_data=True,
+            all_tensors_to_one_file=True,
+            location=f"{model_name}.onnx_data",
+            size_threshold=1024,
+            convert_attribute=False,
+        )
+        from contextlib import suppress  # safe local import if not at top
+        for path in old_tensor_paths:
+            with suppress(FileNotFoundError):
+                os.remove(path)
+    else:
+        onnx.save_model(onnx_opt_graph, onnx_save_path)

Committable suggestion skipped: line range outside the PR's diff.


def forward_hook(_module, _input, _output):
end_event.record()
torch.cuda.synchronize()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you can use even synchronize.

Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between b5223b1 and 99a7e66.

📒 Files selected for processing (1)
  • examples/diffusers/quantization/diffusion_trt.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/diffusers/quantization/diffusion_trt.py (2)
examples/diffusers/cache_diffusion/pipeline/deploy.py (1)
  • compile (202-213)
modelopt/torch/quantization/qtensor/base_qtensor.py (1)
  • to (115-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: linux
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (6)
examples/diffusers/quantization/diffusion_trt.py (6)

26-26: LGTM!

The tqdm import is appropriate for displaying progress during warmup and benchmark runs.


149-149: LGTM!

Help text update accurately describes the parameter.


155-166: LGTM!

The new CLI flags are well-defined and the help text clearly describes their purpose.


204-210: LGTM!

The conditional benchmarking and image generation logic is correct. The early return properly prevents execution of the TensorRT path.


289-296: Remove this review comment - it is based on incorrect assumptions about the implementation.

The original concern assumes that device_model.get_latency() accumulates timing data from prior inference calls. However, the actual implementation shows:

  • get_latency() calls _profile_device() which independently runs self.client.profile(compiled_model=...)
  • This is an active profiling operation, not accumulation of cached data
  • The return type is float (not Optional[float]), confirming it always returns valid data

When --skip-image and --benchmark are both specified, the code works correctly:

  1. Image generation is skipped (line 289–291)
  2. get_latency() is called (line 293–296)
  3. Profiling runs independently and returns valid latency metrics

No code changes needed.

Likely an incorrect or invalid review comment.


62-113: No critical issue found — the average latency calculation is correct.

The calculation on line 110 divides by (num_runs * num_inference_steps) based on the assumption that the backbone is called exactly once per inference step. For the supported pipeline models (SDXL, SD3, Flux), this assumption holds: in standard diffusers implementations, the UNet/Transformer is invoked once during each denoising step and not elsewhere (text encoders are separate modules). The hooks correctly capture all backbone forward passes, and the division yields the accurate per-step latency.

Note: This calculation is contingent on the diffusers library behavior. If the pipeline implementation changes or a different model is used with multiple backbone calls per step, the calculation would need adjustment. Consider documenting this assumption or adding a verification counter to log actual call counts.

Comment on lines 191 to 196
if args.torch_compile:
assert args.model_dtype in ["BFloat16", "Float"], (
"torch.compile() only supports BFloat16 and Float"
)
print("Compiling backbone with torch.compile()...")
backbone = torch.compile(backbone)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move torch.compile() inside the torch path block.

This code executes torch.compile() regardless of whether --torch is specified. When using the TensorRT path (without --torch), compiling the backbone serves no purpose since it gets converted to a TRT engine later. This is both incorrect and wasteful.

Move this block inside the if args.torch: block (after line 198):

     if args.restore_from:
         mto.restore(backbone, args.restore_from)
 
-    if args.torch_compile:
-        assert args.model_dtype in ["BFloat16", "Float"], (
-            "torch.compile() only supports BFloat16 and Float"
-        )
-        print("Compiling backbone with torch.compile()...")
-        backbone = torch.compile(backbone)
-
     if args.torch:
+        if args.torch_compile:
+            assert args.model_dtype in ["BFloat16", "Float"], (
+                "torch.compile() only supports BFloat16 and Float"
+            )
+            print("Compiling backbone with torch.compile()...")
+            backbone = torch.compile(backbone)
+        
         if hasattr(pipe, "transformer"):
             pipe.transformer = backbone
         elif hasattr(pipe, "unet"):
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if args.torch_compile:
assert args.model_dtype in ["BFloat16", "Float"], (
"torch.compile() only supports BFloat16 and Float"
)
print("Compiling backbone with torch.compile()...")
backbone = torch.compile(backbone)
if args.restore_from:
mto.restore(backbone, args.restore_from)
if args.torch:
if args.torch_compile:
assert args.model_dtype in ["BFloat16", "Float"], (
"torch.compile() only supports BFloat16 and Float"
)
print("Compiling backbone with torch.compile()...")
backbone = torch.compile(backbone)
if hasattr(pipe, "transformer"):
pipe.transformer = backbone
elif hasattr(pipe, "unet"):
pipe.unet = backbone
🤖 Prompt for AI Agents
In examples/diffusers/quantization/diffusion_trt.py around lines 191-196, the
torch.compile() block runs unconditionally even when using the TensorRT path;
remove or cut the if args.torch_compile: ... backbone = torch.compile(backbone)
block from here and paste it inside the existing if args.torch: block (after
line 198) so compilation only happens for the torch execution path; keep the
model_dtype assertion and print message, and ensure indentation matches the
torch block.

@ajrasane ajrasane force-pushed the ajrasane/diffusers_benchmarking branch from 99a7e66 to 779be79 Compare October 22, 2025 06:56
Copy link
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

♻️ Duplicate comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)

191-196: Move torch.compile() inside the torch path block.

This block executes unconditionally, even when using the TensorRT path. When --torch is not specified, the backbone gets compiled and then immediately converted to a TensorRT engine, wasting time and resources.

Additionally, the assertion at line 192 now includes "Half" (float16), but torch.compile() may not fully support float16 on all operations.

Move this block inside the if args.torch: block (after line 198):

     if args.restore_from:
         mto.restore(backbone, args.restore_from)
 
-    if args.torch_compile:
-        assert args.model_dtype in ["BFloat16", "Float", "Half"], (
-            "torch.compile() only supports BFloat16 and Float"
-        )
-        print("Compiling backbone with torch.compile()...")
-        backbone = torch.compile(backbone, mode="max-autotune")
-
     if args.torch:
+        if args.torch_compile:
+            assert args.model_dtype in ["BFloat16", "Float"], (
+                "torch.compile() only supports BFloat16 and Float"
+            )
+            print("Compiling backbone with torch.compile()...")
+            backbone = torch.compile(backbone, mode="max-autotune")
+
         if hasattr(pipe, "transformer"):
             pipe.transformer = backbone
🧹 Nitpick comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)

62-113: Consider making the benchmark message backend-agnostic.

The message at line 111 hardcodes "torch backbone", which is accurate for the current usage in the torch path. However, if this function is ever reused for other backends (e.g., TensorRT-optimized), the message would be misleading.

Consider adding a parameter to specify the backend name:

 def benchmark_model(
-    pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half"
+    pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half", backend_name="torch"
 ):
     """Benchmark the backbone model inference time."""
     ...
-    print(f"Inference latency of the torch backbone: {avg_latency:.2f} ms")
+    print(f"Inference latency of the {backend_name} backbone: {avg_latency:.2f} ms")
     return avg_latency
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

📥 Commits

Reviewing files that changed from the base of the PR and between 99a7e66 and 779be79.

📒 Files selected for processing (1)
  • examples/diffusers/quantization/diffusion_trt.py (5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/diffusers/quantization/diffusion_trt.py (1)
examples/diffusers/cache_diffusion/pipeline/deploy.py (1)
  • compile (202-213)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
  • GitHub Check: wait-checks / wait
  • GitHub Check: wait-checks / wait
  • GitHub Check: linux
  • GitHub Check: build-docs
  • GitHub Check: code-quality
🔇 Additional comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)

289-296: ****

The review comment misunderstood how get_latency() works. It does not rely on prior inference data. Instead, get_latency() calls _profile_device(), which invokes client.profile() to independently profile the compiled model. This means calling get_latency() when --skip-image is specified works correctly and will produce valid latency measurements, regardless of whether image generation occurred.

The TensorRT and Torch paths use different profiling methodologies (TensorRT's client.profile() vs. explicit benchmark_model() with configurable warmup/runs), but both are valid approaches for latency measurement. No code changes are required.

Likely an incorrect or invalid review comment.

Copy link
Contributor

@jingyu-ml jingyu-ml left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@ajrasane ajrasane merged commit 2fd67cc into main Oct 22, 2025
27 checks passed
@ajrasane ajrasane deleted the ajrasane/diffusers_benchmarking branch October 22, 2025 22:32
Edwardf0t1 pushed a commit that referenced this pull request Oct 23, 2025
Edwardf0t1 pushed a commit that referenced this pull request Oct 24, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants