-
Notifications
You must be signed in to change notification settings - Fork 190
Add option to benchmark pipeline in diffusion_trt.py #457
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: ajrasane <[email protected]>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
WalkthroughAdds a CUDA-timed backbone benchmarking function with new CLI flags and optional torch.compile() for diffusion inference; refactors ONNX external-data handling by introducing helpers to detect/list external tensors, switching to per-model temporary ONNX directories, updated save/load calls, and adjusted cleanup logic. Changes
Sequence Diagram(s)sequenceDiagram
participant User
participant CLI
participant Pipeline as Model Pipeline
participant Benchmark as benchmark_model()
participant CUDA as CUDA Events
User->>CLI: run script (flags: --benchmark / --skip-image / --torch-compile)
CLI->>CLI: parse args
alt torch compile requested
CLI->>Pipeline: apply torch.compile() to backbone
end
alt --benchmark set
CLI->>Benchmark: call benchmark_model(pipe, prompt, ...)
Benchmark->>Pipeline: attach forward hooks to backbone
Benchmark->>CUDA: record warmup start/end
loop warmup (num_warmup)
Pipeline->>CUDA: forward (start/end events)
end
Benchmark->>CUDA: record timed runs start/end
loop timed runs (num_runs)
Pipeline->>CUDA: forward (start/end events)
end
Benchmark->>CLI: return average latency
CLI->>User: print latency (TensorRT wording if applicable)
end
alt --skip-image not set
CLI->>Pipeline: generate and save image
else --skip-image set
CLI->>CLI: skip image generation
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Poem
Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Signed-off-by: ajrasane <[email protected]>
Signed-off-by: ajrasane <[email protected]>
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #457 +/- ##
=======================================
Coverage 73.37% 73.37%
=======================================
Files 180 180
Lines 17937 17980 +43
=======================================
+ Hits 13161 13193 +32
- Misses 4776 4787 +11 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: ajrasane <[email protected]>
| module._start_time = time.time() | ||
|
|
||
| def forward_hook(module, input, output): | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you feel it will be more valuable to show the GPU time using cuda event instead of the CPU time?
With the GPU time you don't need to call explicit sync
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
| def forward_hook(module, input, output): | ||
| torch.cuda.synchronize() | ||
| module._end_time = time.time() | ||
| backbone_times.append((module._end_time - module._start_time) * 1000) # Convert to ms |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's also fine just do a total time / # of runs. With that maybe you don't need these hooks
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The total time will also record the time of the other components of the pipeline like the encoder and decoder. Hence I have added these hooks only to the backbone as it takes up majority of the pipeline time.
Signed-off-by: ajrasane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
🧹 Nitpick comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)
62-108: Solid implementation with good timing methodology.The CUDA event-based timing correctly measures GPU latency (addressing past feedback), and the warmup/benchmark structure follows best practices.
Consider enhancing the docstring to clarify that the returned latency is per backbone inference step, not per complete image generation:
def benchmark_model(pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20): - """Benchmark the backbone model inference time.""" + """Benchmark the backbone model inference time. + + Returns the average latency per backbone inference step (in milliseconds). + Note: Each full image generation involves num_inference_steps backbone calls. + """ backbone = pipe.transformer if hasattr(pipe, "transformer") else pipe.unetOptionally, add an explicit check after line 64:
if backbone is None: raise ValueError("Pipeline does not have a transformer or unet backbone")Though this is already validated elsewhere, defensive checks can improve robustness.
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/diffusers/quantization/diffusion_trt.py(5 hunks)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (3)
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (4)
examples/diffusers/quantization/diffusion_trt.py (4)
26-26: LGTM! Good UX improvement.The tqdm import enables progress bars during warmup and benchmark runs, improving user experience during long-running operations.
145-145: LGTM! Clear flag documentation.The help text updates and new flags are well-documented and follow CLI conventions.
Also applies to: 151-159
191-195: LGTM! Correct conditional flow.The torch path correctly handles both --benchmark and --skip-image flags with appropriate conditional logic.
275-282: ****The critical issue about
get_latency()returning invalid data is based on incorrect assumptions.get_latency()calls_profile_device(), which invokesself.client.profile()to actively run profiling/inference—it does not rely on cached latency from prior model executions. Therefore, when both--skip-imageand--benchmarkflags are set,get_latency()will correctly measure and return latency even thoughdevice_modelwasn't used in the image generation step.The only legitimate issue is the duplicate print statement:
generate_image()already prints "Image generated saved as {image_name}", making line 277's identical print redundant when--skip-imageis not set.Likely an incorrect or invalid review comment.
Signed-off-by: ajrasane <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 2
🧹 Nitpick comments (1)
modelopt/torch/_deploy/utils/torch_onnx.py (1)
519-526: Make external tensor enumeration model-aware to avoid accidental deletions.Enumerating “everything not .onnx” in a directory is brittle. Derive paths from external_data entries in the ONNX file to delete only what the graph actually references.
Apply this diff to replace the helper:
-def get_external_tensor_paths(model_dir: str) -> list[str]: - """Get the paths of the external data tensors in the model.""" - return [ - os.path.join(model_dir, file) - for file in os.listdir(model_dir) - if not file.endswith(".onnx") - ] +def get_external_tensor_paths_from_model(onnx_model_path: str) -> list[str]: + """Return filesystem paths of external-data tensors referenced by the model.""" + model = onnx.load(onnx_model_path, load_external_data=False) + base_dir = os.path.dirname(os.path.abspath(onnx_model_path)) + paths: set[str] = set() + for tensor in model.graph.initializer: + # external_data is a list of key/value entries; 'location' holds the filename + for entry in tensor.external_data: + if entry.key == "location" and entry.value: + paths.add(os.path.join(base_dir, entry.value)) + return sorted(paths)Note: The call site above was updated to use get_external_tensor_paths_from_model(onnx_save_path).
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
modelopt/torch/_deploy/utils/torch_onnx.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
modelopt/torch/_deploy/utils/torch_onnx.py (1)
modelopt/onnx/utils.py (1)
infer_shapes(723-736)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: code-quality
- GitHub Check: build-docs
🔇 Additional comments (3)
modelopt/torch/_deploy/utils/torch_onnx.py (3)
447-448: Good swap to centralized shape inference.Using modelopt.onnx.utils.infer_shapes() handles >2GB graphs via infer_shapes_path and keeps logic consistent with the rest of the codebase.
505-506: Confirm ONNX save_model supports convert_attribute in all target envs.Repo context lists onnx==1.19.0, which should support convert_attribute, but CI/user envs may lag. If not present, this call raises TypeError.
Please verify compatibility in your CI/runtime. If needed, gate the arg via signature check:
import inspect, onnx kwargs = dict( save_as_external_data=True, all_tensors_to_one_file=True, location=f"{model_name}.onnx_data", size_threshold=1024, ) if "convert_attribute" in inspect.signature(onnx.save_model).parameters: kwargs["convert_attribute"] = False onnx.save_model(onnx_opt_graph, onnx_save_path, **kwargs)
88-97: External-data load path logic looks solid.has_external_data() gate and single .onnx enforcement in the directory prevent ambiguity when bundling external files.
| onnx_path = tempfile.mkdtemp(prefix=f"modelopt_{model_name}_") | ||
| onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx") | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🛠️ Refactor suggestion | 🟠 Major
Ensure temp directory cleanup on exceptions to prevent GB-sized leaks.
If any step between export and OnnxBytes() raises, onnx_path persists. Use try/finally (or TemporaryDirectory) to always rmtree with ignore_errors=True.
Example pattern:
- onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")
+ onnx_save_path = os.path.join(onnx_path, f"{model_name}.onnx")
+ try:
+ # ... export, optimize, save, build onnx_bytes ...
+ onnx_bytes = OnnxBytes(onnx_save_path)
+ result_bytes = onnx_bytes.to_bytes()
+ finally:
+ if remove_exported_model:
+ shutil.rmtree(onnx_path, ignore_errors=True)
- onnx_bytes = OnnxBytes(onnx_save_path)
- if remove_exported_model:
- shutil.rmtree(onnx_path)
- return onnx_bytes.to_bytes(), model_metadata
+ return result_bytes, model_metadataAlso applies to: 514-516
🤖 Prompt for AI Agents
In modelopt/torch/_deploy/utils/torch_onnx.py around lines 416-418 (and likewise
514-516), the temporary directory created with tempfile.mkdtemp can be leaked if
an exception occurs; wrap the creation+export+OnnxBytes sequence in a
try/finally (or replace mkdtemp with tempfile.TemporaryDirectory as a context
manager) so that in the finally block you call shutil.rmtree(onnx_path,
ignore_errors=True); ensure you produce the ONNX bytes (or otherwise read any
files needed) before the cleanup so the function still returns the expected
data.
| # If the onnx model contains external data store the external tensors in one file and save the onnx model | ||
| if has_external_data(onnx_save_path): | ||
| tensor_paths = _get_onnx_external_data_tensors(onnx_opt_graph) | ||
| tensor_paths = get_external_tensor_paths(onnx_path) | ||
| onnx.save_model( | ||
| onnx_opt_graph, | ||
| onnx_save_path, | ||
| save_as_external_data=True, | ||
| all_tensors_to_one_file=True, | ||
| location=f"{model_name}.onnx_data", | ||
| size_threshold=1024, | ||
| convert_attribute=False, | ||
| ) | ||
| for tensor in tensor_paths: | ||
| tensor_path = os.path.join(onnx_path, tensor) | ||
| os.remove(tensor_path) | ||
| for path in tensor_paths: | ||
| os.remove(path) | ||
| else: | ||
| onnx.save_model(onnx_opt_graph, onnx_save_path) | ||
|
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Decide external-data save based on the optimized ModelProto (not the pre-export file).
Current check uses has_external_data(onnx_save_path), which reflects the pre-optimization export. If optimization changes size/layout, you can incorrectly save a >2GB model without external data (protobuf limit), or conversely over-constrain saving. Compute this from onnx_opt_graph instead and fall back to ByteSize guard.
Apply this diff:
- # If the onnx model contains external data store the external tensors in one file and save the onnx model
- if has_external_data(onnx_save_path):
- tensor_paths = get_external_tensor_paths(onnx_path)
- onnx.save_model(
- onnx_opt_graph,
- onnx_save_path,
- save_as_external_data=True,
- all_tensors_to_one_file=True,
- location=f"{model_name}.onnx_data",
- size_threshold=1024,
- convert_attribute=False,
- )
- for path in tensor_paths:
- os.remove(path)
- else:
- onnx.save_model(onnx_opt_graph, onnx_save_path)
+ # Decide external-data save from the optimized graph to avoid 2GB protobuf issues.
+ needs_external = (
+ check_model_uses_external_data(onnx_opt_graph) or onnx_opt_graph.ByteSize() > TWO_GB
+ )
+ if needs_external:
+ old_tensor_paths = get_external_tensor_paths_from_model(onnx_save_path)
+ onnx.save_model(
+ onnx_opt_graph,
+ onnx_save_path,
+ save_as_external_data=True,
+ all_tensors_to_one_file=True,
+ location=f"{model_name}.onnx_data",
+ size_threshold=1024,
+ convert_attribute=False,
+ )
+ from contextlib import suppress # safe local import if not at top
+ for path in old_tensor_paths:
+ with suppress(FileNotFoundError):
+ os.remove(path)
+ else:
+ onnx.save_model(onnx_opt_graph, onnx_save_path)Committable suggestion skipped: line range outside the PR's diff.
|
|
||
| def forward_hook(_module, _input, _output): | ||
| end_event.record() | ||
| torch.cuda.synchronize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can use even synchronize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 1
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/diffusers/quantization/diffusion_trt.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/diffusers/quantization/diffusion_trt.py (2)
examples/diffusers/cache_diffusion/pipeline/deploy.py (1)
compile(202-213)modelopt/torch/quantization/qtensor/base_qtensor.py (1)
to(115-123)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: linux
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (6)
examples/diffusers/quantization/diffusion_trt.py (6)
26-26: LGTM!The tqdm import is appropriate for displaying progress during warmup and benchmark runs.
149-149: LGTM!Help text update accurately describes the parameter.
155-166: LGTM!The new CLI flags are well-defined and the help text clearly describes their purpose.
204-210: LGTM!The conditional benchmarking and image generation logic is correct. The early return properly prevents execution of the TensorRT path.
289-296: Remove this review comment - it is based on incorrect assumptions about the implementation.The original concern assumes that
device_model.get_latency()accumulates timing data from prior inference calls. However, the actual implementation shows:
get_latency()calls_profile_device()which independently runsself.client.profile(compiled_model=...)- This is an active profiling operation, not accumulation of cached data
- The return type is
float(notOptional[float]), confirming it always returns valid dataWhen
--skip-imageand--benchmarkare both specified, the code works correctly:
- Image generation is skipped (line 289–291)
get_latency()is called (line 293–296)- Profiling runs independently and returns valid latency metrics
No code changes needed.
Likely an incorrect or invalid review comment.
62-113: No critical issue found — the average latency calculation is correct.The calculation on line 110 divides by
(num_runs * num_inference_steps)based on the assumption that the backbone is called exactly once per inference step. For the supported pipeline models (SDXL, SD3, Flux), this assumption holds: in standard diffusers implementations, the UNet/Transformer is invoked once during each denoising step and not elsewhere (text encoders are separate modules). The hooks correctly capture all backbone forward passes, and the division yields the accurate per-step latency.Note: This calculation is contingent on the diffusers library behavior. If the pipeline implementation changes or a different model is used with multiple backbone calls per step, the calculation would need adjustment. Consider documenting this assumption or adding a verification counter to log actual call counts.
| if args.torch_compile: | ||
| assert args.model_dtype in ["BFloat16", "Float"], ( | ||
| "torch.compile() only supports BFloat16 and Float" | ||
| ) | ||
| print("Compiling backbone with torch.compile()...") | ||
| backbone = torch.compile(backbone) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Move torch.compile() inside the torch path block.
This code executes torch.compile() regardless of whether --torch is specified. When using the TensorRT path (without --torch), compiling the backbone serves no purpose since it gets converted to a TRT engine later. This is both incorrect and wasteful.
Move this block inside the if args.torch: block (after line 198):
if args.restore_from:
mto.restore(backbone, args.restore_from)
- if args.torch_compile:
- assert args.model_dtype in ["BFloat16", "Float"], (
- "torch.compile() only supports BFloat16 and Float"
- )
- print("Compiling backbone with torch.compile()...")
- backbone = torch.compile(backbone)
-
if args.torch:
+ if args.torch_compile:
+ assert args.model_dtype in ["BFloat16", "Float"], (
+ "torch.compile() only supports BFloat16 and Float"
+ )
+ print("Compiling backbone with torch.compile()...")
+ backbone = torch.compile(backbone)
+
if hasattr(pipe, "transformer"):
pipe.transformer = backbone
elif hasattr(pipe, "unet"):📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
| if args.torch_compile: | |
| assert args.model_dtype in ["BFloat16", "Float"], ( | |
| "torch.compile() only supports BFloat16 and Float" | |
| ) | |
| print("Compiling backbone with torch.compile()...") | |
| backbone = torch.compile(backbone) | |
| if args.restore_from: | |
| mto.restore(backbone, args.restore_from) | |
| if args.torch: | |
| if args.torch_compile: | |
| assert args.model_dtype in ["BFloat16", "Float"], ( | |
| "torch.compile() only supports BFloat16 and Float" | |
| ) | |
| print("Compiling backbone with torch.compile()...") | |
| backbone = torch.compile(backbone) | |
| if hasattr(pipe, "transformer"): | |
| pipe.transformer = backbone | |
| elif hasattr(pipe, "unet"): | |
| pipe.unet = backbone |
🤖 Prompt for AI Agents
In examples/diffusers/quantization/diffusion_trt.py around lines 191-196, the
torch.compile() block runs unconditionally even when using the TensorRT path;
remove or cut the if args.torch_compile: ... backbone = torch.compile(backbone)
block from here and paste it inside the existing if args.torch: block (after
line 198) so compilation only happens for the torch execution path; keep the
model_dtype assertion and print message, and ensure indentation matches the
torch block.
Signed-off-by: ajrasane <[email protected]>
99a7e66 to
779be79
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 0
♻️ Duplicate comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)
191-196: Move torch.compile() inside the torch path block.This block executes unconditionally, even when using the TensorRT path. When
--torchis not specified, the backbone gets compiled and then immediately converted to a TensorRT engine, wasting time and resources.Additionally, the assertion at line 192 now includes "Half" (float16), but torch.compile() may not fully support float16 on all operations.
Move this block inside the
if args.torch:block (after line 198):if args.restore_from: mto.restore(backbone, args.restore_from) - if args.torch_compile: - assert args.model_dtype in ["BFloat16", "Float", "Half"], ( - "torch.compile() only supports BFloat16 and Float" - ) - print("Compiling backbone with torch.compile()...") - backbone = torch.compile(backbone, mode="max-autotune") - if args.torch: + if args.torch_compile: + assert args.model_dtype in ["BFloat16", "Float"], ( + "torch.compile() only supports BFloat16 and Float" + ) + print("Compiling backbone with torch.compile()...") + backbone = torch.compile(backbone, mode="max-autotune") + if hasattr(pipe, "transformer"): pipe.transformer = backbone
🧹 Nitpick comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)
62-113: Consider making the benchmark message backend-agnostic.The message at line 111 hardcodes "torch backbone", which is accurate for the current usage in the torch path. However, if this function is ever reused for other backends (e.g., TensorRT-optimized), the message would be misleading.
Consider adding a parameter to specify the backend name:
def benchmark_model( - pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half" + pipe, prompt, num_warmup=10, num_runs=50, num_inference_steps=20, model_dtype="Half", backend_name="torch" ): """Benchmark the backbone model inference time.""" ... - print(f"Inference latency of the torch backbone: {avg_latency:.2f} ms") + print(f"Inference latency of the {backend_name} backbone: {avg_latency:.2f} ms") return avg_latency
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
📒 Files selected for processing (1)
examples/diffusers/quantization/diffusion_trt.py(5 hunks)
🧰 Additional context used
🧬 Code graph analysis (1)
examples/diffusers/quantization/diffusion_trt.py (1)
examples/diffusers/cache_diffusion/pipeline/deploy.py (1)
compile(202-213)
⏰ Context from checks skipped due to timeout of 90000ms. You can increase the timeout in your CodeRabbit configuration to a maximum of 15 minutes (900000ms). (5)
- GitHub Check: wait-checks / wait
- GitHub Check: wait-checks / wait
- GitHub Check: linux
- GitHub Check: build-docs
- GitHub Check: code-quality
🔇 Additional comments (1)
examples/diffusers/quantization/diffusion_trt.py (1)
289-296: ****The review comment misunderstood how
get_latency()works. It does not rely on prior inference data. Instead,get_latency()calls_profile_device(), which invokesclient.profile()to independently profile the compiled model. This means callingget_latency()when--skip-imageis specified works correctly and will produce valid latency measurements, regardless of whether image generation occurred.The TensorRT and Torch paths use different profiling methodologies (TensorRT's
client.profile()vs. explicitbenchmark_model()with configurable warmup/runs), but both are valid approaches for latency measurement. No code changes are required.Likely an incorrect or invalid review comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Signed-off-by: ajrasane <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>
Signed-off-by: ajrasane <[email protected]> Signed-off-by: Zhiyu Cheng <[email protected]>
What does this PR do?
Type of change:
Example update
Overview:
get_onnx_bytes_and_metadata()API to remove redundant external data for ONNX modelsUsage
Benchmark Flux model torch
Benchmark Flux model torch compile
Benchmark Flux model TRT
Testing
Flux backbone latencies:
Before your PR is "Ready for review"
Summary by CodeRabbit
New Features
Documentation
Refactor