Skip to content

Conversation

@codeflash-ai
Copy link

@codeflash-ai codeflash-ai bot commented Jun 1, 2025

📄 5% (0.05x) speedup for HunyuanVideoDownsampleCausal3D.forward in src/diffusers/models/autoencoders/autoencoder_kl_hunyuan_video.py

⏱️ Runtime : 2.41 milliseconds 2.29 milliseconds (best of 233 runs)

📝 Explanation and details

Optimization notes:

  • The main runtime was in self.conv(hidden_states). If inference-only, use torch.no_grad() for faster and more memory-efficient inference.
  • Avoid unnecessary variable assignments in the forward pass.
  • Avoid unnecessary imports: removed torch.utils.checkpoint since not in use.
  • Guarded fastpath is_contiguous, so only when input is already contiguous, avoiding internal tensor copying/allocation for some nn modules (helpful for 3D data), and only applied during inference.
  • No structural changes that impact return value or output.
  • Preserved all comments not related to the code that was slightly altered.

Correctness verification report:

Test Status
⚙️ Existing Unit Tests 🔘 None Found
🌀 Generated Regression Tests 37 Passed
⏪ Replay Tests 🔘 None Found
🔎 Concolic Coverage Tests 🔘 None Found
📊 Tests Coverage 100.0%
🌀 Generated Regression Tests Details
from typing import Optional

# imports
import pytest  # used for our unit tests
import torch
import torch.nn as nn
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# function to test
# Copyright 2024 The Hunyuan Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


# Dummy implementation of HunyuanVideoCausalConv3d for testing purposes
class HunyuanVideoCausalConv3d(nn.Conv3d):
    # For this test, we assume causal convolution is just Conv3d with padding
    # In reality, causal convolution would have a more complex implementation
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias=True):
        super().__init__(
            in_channels=in_channels,
            out_channels=out_channels,
            kernel_size=kernel_size,
            stride=stride,
            padding=padding,
            bias=bias
        )
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# unit tests

# ----------------------- BASIC TEST CASES -----------------------

def test_forward_basic_shape_and_type():
    # Test that output shape and dtype are correct for simple input
    model = HunyuanVideoDownsampleCausal3D(channels=3, out_channels=4, kernel_size=3, stride=2, padding=1)
    x = torch.randn(2, 3, 8, 16, 16)  # (batch, channels, depth, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_default_out_channels():
    # Test that out_channels defaults to channels if not specified
    model = HunyuanVideoDownsampleCausal3D(channels=5, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 5, 6, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_bias_false():
    # Test that the model works when bias=False
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, bias=False)
    x = torch.randn(1, 2, 4, 4, 4)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_stride_one():
    # Test with stride=1 (no downsampling)
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, stride=1)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_kernel_size_tuple():
    # Test with kernel_size and stride as tuples
    model = HunyuanVideoDownsampleCausal3D(
        channels=2, out_channels=2, kernel_size=(3, 3, 3), stride=(2, 2, 2), padding=(1, 1, 1)
    )
    x = torch.randn(1, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

# ----------------------- EDGE TEST CASES -----------------------

def test_forward_minimal_input():
    # Test minimal possible input (single batch, single channel, single voxel)
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1, kernel_size=1, stride=1, padding=0)
    x = torch.randn(1, 1, 1, 1, 1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_zero_padding():
    # Test with zero padding, kernel size 3, stride 1
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=3, stride=1, padding=0)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_kernel():
    # Test with kernel size equal to input size
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=5, stride=1, padding=0)
    x = torch.randn(1, 2, 5, 5, 5)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_non_divisible_stride():
    # Test when input size is not divisible by stride
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 2, 7, 7, 7)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_singleton_batch_channel():
    # Test with batch size 1 and channel size 1
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1, kernel_size=3, stride=2, padding=1)
    x = torch.randn(1, 1, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_invalid_input_shape():
    # Test that an error is raised for invalid input shape (not 5D)
    model = HunyuanVideoDownsampleCausal3D(channels=2)
    x = torch.randn(2, 2, 8, 8)  # 4D instead of 5D
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_channels_mismatch():
    # Test that an error is raised if input channels do not match model channels
    model = HunyuanVideoDownsampleCausal3D(channels=3, out_channels=2)
    x = torch.randn(1, 2, 8, 8, 8)  # input channels=2, model expects 3
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_negative_stride():
    # Test that negative stride raises an error
    with pytest.raises(ValueError):
        HunyuanVideoDownsampleCausal3D(channels=2, stride=-1)

def test_forward_zero_stride():
    # Test that zero stride raises an error
    with pytest.raises(ValueError):
        HunyuanVideoDownsampleCausal3D(channels=2, stride=0)

def test_forward_empty_tensor():
    # Test with empty tensor (zero batch size)
    model = HunyuanVideoDownsampleCausal3D(channels=2)
    x = torch.randn(0, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y = codeflash_output

# ----------------------- LARGE SCALE TEST CASES -----------------------

def test_forward_large_batch():
    # Test with large batch size
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(64, 2, 8, 8, 8)  # 64 batches
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_spatial():
    # Test with large spatial dimensions, but <100MB tensor
    model = HunyuanVideoDownsampleCausal3D(channels=1, out_channels=1)
    x = torch.randn(1, 1, 16, 32, 32)  # 1*1*16*32*32*4 = 65,536 floats = 262,144 bytes
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_channels():
    # Test with large number of channels, but <100MB tensor
    model = HunyuanVideoDownsampleCausal3D(channels=32, out_channels=64)
    x = torch.randn(2, 32, 8, 8, 8)  # 2*32*8*8*8*4 = 32,768 floats = 131,072 bytes
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_gradient_flow():
    # Test that gradients flow through the module
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(2, 2, 8, 8, 8, requires_grad=True)
    codeflash_output = model.forward(x); y = codeflash_output
    loss = y.sum()
    loss.backward()

def test_forward_different_dtypes():
    # Test with float32 and float64
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x32 = torch.randn(1, 2, 8, 8, 8, dtype=torch.float32)
    x64 = torch.randn(1, 2, 8, 8, 8, dtype=torch.float64)
    codeflash_output = model.forward(x32); y32 = codeflash_output
    codeflash_output = model.forward(x64); y64 = codeflash_output

def test_forward_device_cpu_cuda():
    # Test that the model works on both CPU and CUDA (if available)
    model = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2)
    x = torch.randn(1, 2, 8, 8, 8)
    codeflash_output = model.forward(x); y_cpu = codeflash_output
    if torch.cuda.is_available():
        model_cuda = HunyuanVideoDownsampleCausal3D(channels=2, out_channels=2).cuda()
        x_cuda = x.cuda()
        codeflash_output = model_cuda.forward(x_cuda); y_cuda = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

from typing import Optional

# imports
import pytest  # used for our unit tests
import torch
import torch.nn as nn
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# function to test
# Copyright 2024 The Hunyuan Team and The HuggingFace Team. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
#     http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.


class HunyuanVideoCausalConv3d(nn.Conv3d):
    """
    A simple causal 3D convolution implementation for testing purposes.
    This implementation ensures causality by zeroing out weights for future frames in the temporal dimension.
    """
    def __init__(self, in_channels, out_channels, kernel_size, stride, padding, bias=True):
        if isinstance(kernel_size, int):
            kernel_size = (kernel_size, kernel_size, kernel_size)
        if isinstance(stride, int):
            stride = (stride, stride, stride)
        if isinstance(padding, int):
            padding = (padding, padding, padding)
        super().__init__(
            in_channels, out_channels, kernel_size=kernel_size, stride=stride, padding=padding, bias=bias
        )
        self.kernel_size = kernel_size

    def forward(self, input: torch.Tensor) -> torch.Tensor:
        # For a causal convolution, zero out weights for future frames in the temporal dimension
        # Only allow access to current and past frames
        # This is a simplified version for testing
        weight = self.weight.clone()
        t_center = self.kernel_size[0] // 2
        # Zero out weights that correspond to future frames
        if self.kernel_size[0] > 1:
            weight[:, :, t_center + 1 :, :, :] = 0
        # Use F.conv3d directly to avoid recursion
        return nn.functional.conv3d(
            input, weight, self.bias, self.stride, self.padding, self.dilation, self.groups
        )
from src.diffusers.models.autoencoders.autoencoder_kl_hunyuan_video import \
    HunyuanVideoDownsampleCausal3D

# unit tests

# --------- Basic Test Cases ---------

def test_forward_basic_shape_and_type():
    # Test that output shape and type are correct for a simple input
    batch, channels, time, height, width = 2, 3, 8, 16, 16
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape: batch, channels, downsampled time, downsampled height, downsampled width
    # Default kernel_size=3, stride=2, padding=1
    # Output size formula: floor((input + 2*pad - kernel)//stride + 1)
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1

def test_forward_out_channels():
    # Test that out_channels argument changes output shape
    batch, channels, time, height, width = 1, 4, 10, 10, 10
    out_channels = 6
    model = HunyuanVideoDownsampleCausal3D(channels, out_channels=out_channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_kernel_and_stride_variants():
    # Test with different kernel_size and stride
    batch, channels, time, height, width = 1, 2, 7, 12, 12
    model = HunyuanVideoDownsampleCausal3D(channels, kernel_size=5, stride=3, padding=2)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape calculation
    def out_dim(i, k=5, s=3, p=2):
        return (i + 2*p - k)//s + 1

def test_forward_bias_false():
    # Test that disabling bias works
    batch, channels, time, height, width = 1, 2, 5, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels, bias=False)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_grad():
    # Test that gradients flow through the module
    batch, channels, time, height, width = 1, 2, 6, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width, requires_grad=True)
    codeflash_output = model.forward(x); y = codeflash_output
    loss = y.sum()
    loss.backward()

# --------- Edge Test Cases ---------

def test_forward_singleton_dimensions():
    # Test with singleton batch, channel, spatial and temporal dimensions
    x = torch.randn(1, 1, 1, 1, 1)
    model = HunyuanVideoDownsampleCausal3D(1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_minimum_input_size():
    # Test with minimum input size that allows a single convolution
    # For kernel_size=3, padding=1, stride=2: input size 3
    x = torch.randn(1, 1, 3, 3, 3)
    model = HunyuanVideoDownsampleCausal3D(1)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_odd_even_dimensions():
    # Test with odd and even input sizes
    x = torch.randn(1, 2, 5, 6, 7)
    model = HunyuanVideoDownsampleCausal3D(2)
    codeflash_output = model.forward(x); y = codeflash_output
    # Output shape calculation
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1


def test_forward_causality():
    # Test that the convolution is causal in the temporal dimension
    # Changing a future frame should not affect the output at the current/past time
    batch, channels, time, height, width = 1, 1, 6, 4, 4
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    x2 = x.clone()
    # Change the last time frame (future frame for earlier outputs)
    x2[:, :, -1] += 1000
    codeflash_output = model.forward(x); y1 = codeflash_output
    codeflash_output = model.forward(x2); y2 = codeflash_output

def test_forward_invalid_input_shape():
    # Test that invalid input shape raises an error
    model = HunyuanVideoDownsampleCausal3D(2)
    x = torch.randn(1, 2, 8, 8)  # missing one dimension
    with pytest.raises(RuntimeError):
        model.forward(x)

def test_forward_channel_mismatch():
    # Test that channel mismatch raises an error
    model = HunyuanVideoDownsampleCausal3D(3)
    x = torch.randn(1, 2, 8, 8, 8)  # input channels != model channels
    with pytest.raises(RuntimeError):
        model.forward(x)

# --------- Large Scale Test Cases ---------

def test_forward_large_batch():
    # Test with large batch size
    batch, channels, time, height, width = 32, 2, 8, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_spatial():
    # Test with large spatial dimensions (but <100MB)
    batch, channels, time, height, width = 1, 2, 4, 64, 64
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output
    # Check output shape
    def out_dim(i, k=3, s=2, p=1):
        return (i + 2*p - k)//s + 1

def test_forward_large_temporal():
    # Test with large temporal dimension (but <100MB)
    batch, channels, time, height, width = 1, 2, 64, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_large_channels():
    # Test with large number of channels (but <100MB)
    batch, channels, time, height, width = 1, 64, 4, 8, 8
    model = HunyuanVideoDownsampleCausal3D(channels)
    x = torch.randn(batch, channels, time, height, width)
    codeflash_output = model.forward(x); y = codeflash_output

def test_forward_multiple_large_inputs():
    # Test with multiple large inputs in a loop (but keep total memory small)
    batch, channels, time, height, width = 2, 8, 8, 16, 16
    model = HunyuanVideoDownsampleCausal3D(channels)
    for _ in range(5):
        x = torch.randn(batch, channels, time, height, width)
        codeflash_output = model.forward(x); y = codeflash_output
# codeflash_output is used to check that the output of the original code is the same as that of the optimized code.

To edit these changes git checkout codeflash/optimize-HunyuanVideoDownsampleCausal3D.forward-mbdzh935 and push.

Codeflash

**Optimization notes:**
- The main runtime was in `self.conv(hidden_states)`. If inference-only, use `torch.no_grad()` for faster and more memory-efficient inference.
- Avoid unnecessary variable assignments in the forward pass.
- Avoid unnecessary imports: removed `torch.utils.checkpoint` since not in use.
- Guarded fastpath is_contiguous, so only when input is already contiguous, avoiding internal tensor copying/allocation for some nn modules (helpful for 3D data), and only applied during inference.
- No structural changes that impact return value or output.
- Preserved all comments not related to the code that was slightly altered.
@codeflash-ai codeflash-ai bot added the ⚡️ codeflash Optimization PR opened by Codeflash AI label Jun 1, 2025
@codeflash-ai codeflash-ai bot requested a review from aseembits93 June 1, 2025 18:17
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

⚡️ codeflash Optimization PR opened by Codeflash AI

Projects

None yet

Development

Successfully merging this pull request may close these issues.

0 participants