⚡️ Speed up function rescale_noise_cfg by 44%
#143
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 44% (0.44x) speedup for
rescale_noise_cfginsrc/diffusers/pipelines/stable_diffusion_xl/pipeline_stable_diffusion_xl_inpaint.py⏱️ Runtime :
6.65 milliseconds→4.61 milliseconds(best of320runs)📝 Explanation and details
Here is an optimized version of the provided program.
Key performance recommendations based on the line profiler.
Avoid repeated construction of axes:
list(range(1, x.ndim))is a minor but avoidable overhead, especially when applied twice. Store it once.Minimize Python-side operations:
Use tuple for axes directly and avoid redundant list constructions.
Move computation of axes outside to avoid recompute on every call:
Since axes are always
tuple(range(1, x.ndim)), define a tiny helper for this, but to keep the single-function signature, inline it in each call.Use in-place math when possible:
While PyTorch Tensors will not always benefit from in-place ops due to autograd, the operations here do not require gradients, so we can consider in-place modification, but for safety, stick to the out-of-place as it’s already vectorized.
Avoid duplicate computation when
guidance_rescale==0.0:If guidance_rescale is 0, just return original input.
Similarly, if it's 1.0, shortcut to fully rescaled output.
Early return to minimize computation on defaults.
Here’s the optimized code.
Summary of optimizations:
This implementation will reduce both Python-overhead and runtime, especially with frequent small tensor calls.
✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-rescale_noise_cfg-mbdv2rmjand push.