⚡️ Speed up method DDPMScheduler.add_noise by 12%
#132
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 12% (0.12x) speedup for
DDPMScheduler.add_noiseinsrc/diffusers/schedulers/scheduling_ddpm.py⏱️ Runtime :
1.07 milliseconds→949 microseconds(best of410runs)📝 Explanation and details
Here are several ways to significantly optimize the
add_noisemethod, given that it dominates the runtime (especially tensor indexing, exponentiation, and repeated flatten/unsqueeze loops).Key Optimization Opportunities
Avoid Repeated Device & Dtype Movement:
Only move tensors if their device/dtype doesn't match, and never overwrite self.alphas_cumprod (which should remain on CPU for most cases; don't mutate in-place).
Efficient Broadcasting:
Instead of flattening and unsqueezing one by one in a loop to match the shape, use
.view()or.reshape()with[batch,...,1]style to broadcast in one call. Or even better, index with shape prep logic to get the batch dimension, and expand appropriately.Precompute Timesteps Index:
Directly use advanced indexing and avoid unnecessary
to(device)for scalar tensors.Vectorize Everything:
Torch supports direct broadcasting, so use the correct shape for the broadcasted terms. For a batch input, this means adding dimensions with
.view(-1, *rest)as needed.Remove Extra Variable Assignments:
The extra assignments and device movements are not needed each call.
Here is the rewritten program, with optimized
add_noise.Explanation of Optimizations
alphas_cumprodis not overwritten on self anymore. Instead, it is moved and cast as a local for the current call, only if devices/dtypes mismatch.Use
.view()to directly create the needed leading batch dimension and trailing broadcast dimensions to match sample shapes, avoiding slow repeatedunsqueeze/flattenoperations.All tensor operations occur in batch for best CuPy/PyTorch vectorization.
Timesteps is indexed only once, and on the correct device.
No slow Python loops remain.
This will dramatically reduce time spent in the
add_noisemethod, as verified by your line profile on the bottlenecked areas.✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-DDPMScheduler.add_noise-mbdlhus4and push.