-
Notifications
You must be signed in to change notification settings - Fork 140
Closed
Description
I encountered an issue related to the input shape of vae.encode in the file cogvideox_image_to_video_lora.py at line 659.
Currently, the code looks like this:
noisy_images = images + torch.randn_like(images) * image_noise_sigma[:, None, None, None, None]
image_latent_dist = vae.encode(noisy_images).latent_distHowever, this results in a shape mismatch error when passing noisy_images to the VAE. I believe the correct shape for the input should be [B, C, F, H, W] instead of the current form. The modification I made to resolve the issue is as follows:
noisy_images = images + torch.randn_like(images) * image_noise_sigma[:, None, None, None, None] # [B, F, C, H, W]
noisy_images = noisy_images.permute(0, 2, 1, 3, 4) # [B, C, F, H, W]Without this change, the following error occurs:
RuntimeError: torch.cat(): expected a non-empty list of Tensors
File "path/site-packages/diffusers/utils/accelerate_utils.py", line 46, in wrapper
return method(self, *args, **kwargs)
File "path/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1224, in encode
h = self._encode(x)
File "path/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1181, in _encode
return self.tiled_encode(x)
File "path/site-packages/diffusers/models/autoencoders/autoencoder_kl_cogvideox.py", line 1357, in tiled_encode
row.append(torch.cat(time, dim=2))Additionally, I did not run prepare_dataset.py before training. I wanted to confirm if skipping this step could also be contributing to the issue, or if my proposed shape transformation is the correct fix.
Any guidance or confirmation would be greatly appreciated!
Metadata
Metadata
Assignees
Labels
No labels