Add VidTok AutoEncoders #11261

annitang1997 · 2025-04-09T17:07:11Z

We add VidTok, a versatile and state-of-the-art video tokenizer, as an autoencoder model to diffusers.

Paper: https://arxiv.org/pdf/2412.13061
Code: https://github.com/microsoft/VidTok
Model: https://huggingface.co/microsoft/VidTok

a-r-r-o-w · 2025-04-10T06:53:04Z

Thank you for the PR @annitang1997! I will review this in depth soon. cc @yiyixuxu too

deeptimhe · 2025-04-20T09:45:44Z

Is there any updates on the review process? 👀 Looking forward to use VidTok with diffusers.

a-r-r-o-w

Thank you for the PR and congratulations for the release of your awesome work!

I did a first pass review about some changes that need to be made to make the implementation similar to remaining of the diffusers codebase. There are some core implementation details that will have to be refactored before we can merge. A good reference implementation for autoencoders can be found here:

I'd be happy to help assist in making some of these changes! 🤗

a-r-r-o-w · 2025-04-21T12:45:37Z

src/diffusers/models/autoencoders/vae.py

        return z_q


+class FSQRegularizer(nn.Module):


We're moving towards maintaining a single file per modeling implementation, and so let's move this to the vidtok autoencoder file

a-r-r-o-w · 2025-04-21T12:46:00Z

src/diffusers/models/downsampling.py

        return F.conv2d(inputs, weight, stride=2)


+class VidTokDownsample2D(nn.Module):


Let's move this to vidtok autoencoder file as well

a-r-r-o-w · 2025-04-21T12:46:04Z

src/diffusers/models/normalization.py

        return hidden_states, encoder_hidden_states, gate[:, None, :], enc_gate[:, None, :]


+class VidTokLayerNorm(nn.Module):


Let's move this to vidtok autoencoder file as well

a-r-r-o-w · 2025-04-21T12:46:11Z

src/diffusers/models/upsampling.py

        return F.conv_transpose2d(inputs, weight, stride=2, padding=self.pad * 2 + 1)


+class VidTokUpsample2D(nn.Module):


Let's move this to vidtok autoencoder file as well

a-r-r-o-w · 2025-04-21T12:47:28Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+import torch
+import torch.nn as nn
+import torch.nn.functional as F
+from einops import pack, rearrange, unpack


Need to replace all einops operations with permute/reshape/other ops since it adds another dependancy which we don't use in the codebase

a-r-r-o-w · 2025-04-21T12:55:06Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+
+            def create_custom_forward(module):
+                def custom_forward(*inputs):
+                    return module.downsample(*inputs)
+
+                return custom_forward
+


Suggested change

def create_custom_forward(module):

def custom_forward(*inputs):

return module.downsample(*inputs)

return custom_forward

a-r-r-o-w · 2025-04-21T12:55:37Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                if i_level in self.spatial_ds:
+                    # spatial downsample
+                    htmp = rearrange(hs[-1], "b c t h w -> (b t) c h w")
+                    htmp = torch.utils.checkpoint.checkpoint(create_custom_forward(self.down[i_level]), htmp)


Suggested change

htmp = torch.utils.checkpoint.checkpoint(create_custom_forward(self.down[i_level]), htmp)

htmp = self._gradient_checkpointing_func(self.down[i_level], htmp)

a-r-r-o-w · 2025-04-21T12:55:52Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                    B, _, T, H, W = htmp.shape
+            # middle
+            h = hs[-1]
+            h = torch.utils.checkpoint.checkpoint(self.mid.block_1, h, temb)


same comment as above for these usages

a-r-r-o-w · 2025-04-21T12:56:13Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        return h
+
+
+class AutoencoderVidTok(ModelMixin, ConfigMixin, FromOriginalModelMixin):


Suggested change

class AutoencoderVidTok(ModelMixin, ConfigMixin, FromOriginalModelMixin):

class AutoencoderVidTok(ModelMixin, ConfigMixin):

a-r-r-o-w · 2025-04-21T12:56:48Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        self.tile_overlap_factor_width = 0.0  # 1 / 8
+
+    @staticmethod
+    def pad_at_dim(


Any methods that are not to be directly invoked by users should be made private (that is prefix with an underscore _pad_at_dim)

annitang1997 · 2025-05-09T16:30:52Z

Hello, I have improved the code based on your feedback. Please check it. 🤗

deeptimhe · 2025-05-23T10:17:41Z

Any updates in this thread? :)

a-r-r-o-w · 2025-05-23T19:34:07Z

@deeptimhe Sorry for the delay, I'm on leave at the moment, and so is @yiyixuxu. I'll try to test the PR and give it a look next week when I'm back

yiyixuxu

thanks for the PR!
I left some feedbacks, one note on diffusers coding style is we try not to use too many small methods/functions. ideally all the logics are implemented in forward

I made a few examples in the review, if you can apply similar changes through out the implementation it would be great:)

yiyixuxu · 2025-06-17T14:57:41Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+            codes = codes.permute(0, -1, *range(1, codes.dim() - 1))
+        return codes
+
+    @torch.cuda.amp.autocast(enabled=False)


can you remove the autocast?

yiyixuxu · 2025-06-17T15:26:39Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        self.global_codebook_usage = torch.zeros([2**self.codebook_dim, self.num_codebooks], dtype=torch.long)
+
+    @staticmethod
+    def default(*args) -> Any:


can we remove this method and add default directly in the signature?

yiyixuxu · 2025-06-17T15:32:25Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        self.num_codebooks = num_codebooks
+        self.effective_codebook_dim = effective_codebook_dim
+
+        keep_num_codebooks_dim = self.default(keep_num_codebooks_dim, num_codebooks > 1)


Suggested change

keep_num_codebooks_dim = self.default(keep_num_codebooks_dim, num_codebooks > 1)

if keep_num_codebooks_dim is None:

keep_num_codebooks_dim = num_codebooks > 1

yiyixuxu · 2025-06-17T15:36:08Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        self.effective_codebook_dim = effective_codebook_dim
+
+        keep_num_codebooks_dim = self.default(keep_num_codebooks_dim, num_codebooks > 1)
+        assert not (num_codebooks > 1 and not keep_num_codebooks_dim)


Suggested change

assert not (num_codebooks > 1 and not keep_num_codebooks_dim)

yiyixuxu · 2025-06-17T15:39:26Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        assert not (num_codebooks > 1 and not keep_num_codebooks_dim)
+        self.keep_num_codebooks_dim = keep_num_codebooks_dim
+
+        self.dim = self.default(dim, len(_levels) * num_codebooks)


Suggested change

self.dim = self.default(dim, len(_levels) * num_codebooks)

self.dim = len(_levels) * num_codebooks if dim is None else dim

yiyixuxu · 2025-06-17T15:50:48Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        half_width = self._levels // 2
+        return quantized / half_width
+
+    def _scale_and_shift(self, zhat_normalized: torch.Tensor) -> torch.Tensor:


remove this method and move the code into codes_to_indices

yiyixuxu · 2025-06-17T15:51:10Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        half_width = self._levels // 2
+        return (zhat_normalized * half_width) + half_width
+
+    def _scale_and_shift_inverse(self, zhat: torch.Tensor) -> torch.Tensor:


same for this method

yiyixuxu · 2025-06-17T15:53:02Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+                is_video = False
+            z = z.reshape(b, d, -1).permute(0, 2, 1)
+
+        assert z.shape[-1] == self.dim, f"expected dimension of {self.dim} but found dimension of {z.shape[-1]}"


Suggested change

assert z.shape[-1] == self.dim, f"expected dimension of {self.dim} but found dimension of {z.shape[-1]}"

yiyixuxu · 2025-06-17T15:54:48Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        self.cache_offset = 0
+
+    @staticmethod
+    def _cast_tuple(t: Union[Tuple[int], int], length: int = 1) -> Tuple[int]:


remove this method

yiyixuxu · 2025-06-17T15:56:46Z

src/diffusers/models/autoencoders/autoencoder_vidtok.py

+        super().__init__()
+        self.pad_mode = pad_mode
+
+        kernel_size = self._cast_tuple(kernel_size, 3)


Suggested change

kernel_size = self._cast_tuple(kernel_size, 3)

if isinstance(kernel_size, int):

kernel_size = (kernel_size,) * 3

annitang1997 · 2025-07-02T07:09:40Z

Hello, I have cleaned the code by removing small methods/functions based on your feedback. Please check it. 🤗

annitang1997 · 2025-07-28T14:17:06Z

Any updates in this thread? :)

annitang1997 added 2 commits April 10, 2025 00:47

add_autoencoder_vidtok

371aa27

Merge branch 'main' into add_autoencoder_vidtok

b2dc1ef

Merge branch 'main' into add_autoencoder_vidtok

0ce6be7

a-r-r-o-w reviewed Apr 21, 2025

View reviewed changes

annitang1997 added 3 commits May 3, 2025 13:05

Merge branch 'huggingface:main' into add_autoencoder_vidtok

f0f5c58

format standardization

b4e1deb

Merge branch 'huggingface:main' into add_autoencoder_vidtok

a466717

annitang1997 added 2 commits June 11, 2025 20:28

Merge branch 'main' into add_autoencoder_vidtok

4c4c051

Merge branch 'main' into add_autoencoder_vidtok

1ad58e5

yiyixuxu reviewed Jun 17, 2025

View reviewed changes

annitang1997 added 2 commits July 2, 2025 13:40

Merge branch 'huggingface:main' into add_autoencoder_vidtok

f552028

remove small functions

3506971

		return F.conv2d(inputs, weight, stride=2)


		class VidTokDownsample2D(nn.Module):

		return hidden_states, encoder_hidden_states, gate[:, None, :], enc_gate[:, None, :]


		class VidTokLayerNorm(nn.Module):

		return F.conv_transpose2d(inputs, weight, stride=2, padding=self.pad * 2 + 1)


		class VidTokUpsample2D(nn.Module):

	htmp = torch.utils.checkpoint.checkpoint(create_custom_forward(self.down[i_level]), htmp)
	htmp = self._gradient_checkpointing_func(self.down[i_level], htmp)

		return h


		class AutoencoderVidTok(ModelMixin, ConfigMixin, FromOriginalModelMixin):

	keep_num_codebooks_dim = self.default(keep_num_codebooks_dim, num_codebooks > 1)
	if keep_num_codebooks_dim is None:
	keep_num_codebooks_dim = num_codebooks > 1

	self.dim = self.default(dim, len(_levels) * num_codebooks)
	self.dim = len(_levels) * num_codebooks if dim is None else dim

	kernel_size = self._cast_tuple(kernel_size, 3)
	if isinstance(kernel_size, int):
	kernel_size = (kernel_size,) * 3

Add VidTok AutoEncoders #11261

Are you sure you want to change the base?

Add VidTok AutoEncoders #11261

Uh oh!

Conversation

annitang1997 commented Apr 9, 2025

Uh oh!

a-r-r-o-w commented Apr 10, 2025

Uh oh!

deeptimhe commented Apr 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

a-r-r-o-w left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annitang1997 commented May 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

deeptimhe commented May 23, 2025

Uh oh!

a-r-r-o-w commented May 23, 2025

Uh oh!

yiyixuxu left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

annitang1997 commented Jul 2, 2025

Uh oh!

annitang1997 commented Jul 28, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

deeptimhe commented Apr 20, 2025 •

edited

Loading

annitang1997 commented May 9, 2025 •

edited

Loading