⚡️ Speed up function _prepare_for_blend by 63%
#130
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
📄 63% (0.63x) speedup for
_prepare_for_blendinsrc/diffusers/models/autoencoders/autoencoder_kl_allegro.py⏱️ Runtime :
3.03 milliseconds→1.85 milliseconds(best of273runs)📝 Explanation and details
The main bottleneck in your code comes from repeatedly generating the 1D blend mask tensors with
torch.arange(...).float().to(x.device) / overlap_x, followed by reshaping within every inner call. These are the lines where most time is spent and can be optimized.Key idea: Precompute and cache the blend mask tensors for each overlap size seen during the runtime and reuse them.
We can add a helper to cache each blend mask tensor per (overlap, device) and direction ("start"/"end").
Summary of optimizations:
This rewrite drastically reduces per-call runtime for the expensive masked multiplications.
The output is mathematically identical to your original code.
✅ Correctness verification report:
🌀 Generated Regression Tests Details
To edit these changes
git checkout codeflash/optimize-_prepare_for_blend-mbdjqwqvand push.