You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: src/diffusers/models/transformers/transformer_cogview3plus.py
+21-16Lines changed: 21 additions & 16 deletions
Original file line number
Diff line number
Diff line change
@@ -140,20 +140,22 @@ class CogView3PlusTransformer2DModel(ModelMixin, ConfigMixin):
140
140
time_embed_dim (`int`, defaults to `512`):
141
141
Output dimension of timestep embeddings.
142
142
condition_dim (`int`, defaults to `256`):
143
-
The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size, crop_coords).
143
+
The embedding dimension of the input SDXL-style resolution conditions (original_size, target_size,
144
+
crop_coords).
144
145
pooled_projection_dim (`int`, defaults to `1536`):
145
-
The overall pooled dimension by concatenating SDXL-style resolution conditions. As 3 additional conditions are
146
-
used (original_size, target_size, crop_coords), and each is a sinusoidal condition of dimension `2 * condition_dim`,
147
-
we get the pooled projection dimension as `2 * condition_dim * 3 => 1536`. The timestep embeddings will be projected
148
-
to this dimension as well.
149
-
TODO(yiyi): Do we need this parameter based on the above explanation?
146
+
The overall pooled dimension by concatenating SDXL-style resolution conditions. As 3 additional conditions
147
+
are used (original_size, target_size, crop_coords), and each is a sinusoidal condition of dimension `2 *
148
+
condition_dim`, we get the pooled projection dimension as `2 * condition_dim * 3 => 1536`. The timestep
149
+
embeddings will be projected to this dimension as well. TODO(yiyi): Do we need this parameter based on the
150
+
above explanation?
150
151
pos_embed_max_size (`int`, defaults to `128`):
151
-
The maximum resolution of the positional embeddings, from which slices of shape `H x W` are taken and added to input
152
-
patched latents, where `H` and `W` are the latent height and width respectively. A value of 128 means that the maximum
153
-
supported height and width for image generation is `128 * vae_scale_factor * patch_size => 128 * 8 * 2 => 2048`.
152
+
The maximum resolution of the positional embeddings, from which slices of shape `H x W` are taken and added
153
+
to input patched latents, where `H` and `W` are the latent height and width respectively. A value of 128
154
+
means that the maximum supported height and width for image generation is `128 * vae_scale_factor *
155
+
patch_size => 128 * 8 * 2 => 2048`.
154
156
sample_size (`int`, defaults to `128`):
155
-
The base resolution of input latents. If height/width is not provided during generation, this value is used to determine
156
-
the resolution as `sample_size * vae_scale_factor => 128 * 8 => 1024`
157
+
The base resolution of input latents. If height/width is not provided during generation, this value is used
158
+
to determine the resolution as `sample_size * vae_scale_factor => 128 * 8 => 1024`
157
159
"""
158
160
159
161
_supports_gradient_checkpointing=True
@@ -336,16 +338,19 @@ def forward(
336
338
hidden_states (`torch.Tensor`):
337
339
Input `hidden_states` of shape `(batch size, channel, height, width)`.
338
340
encoder_hidden_states (`torch.Tensor`):
339
-
Conditional embeddings (embeddings computed from the input conditions such as prompts)
340
-
of shape `(batch_size, sequence_len, text_embed_dim)`
341
+
Conditional embeddings (embeddings computed from the input conditions such as prompts) of shape
342
+
`(batch_size, sequence_len, text_embed_dim)`
341
343
timestep (`torch.LongTensor`):
342
344
Used to indicate denoising step.
343
345
original_size (`torch.Tensor`):
344
-
CogView3 uses SDXL-like micro-conditioning for original image size as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
346
+
CogView3 uses SDXL-like micro-conditioning for original image size as explained in section 2.2 of
CogView3 uses SDXL-like micro-conditioning for target image size as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
349
+
CogView3 uses SDXL-like micro-conditioning for target image size as explained in section 2.2 of
CogView3 uses SDXL-like micro-conditioning for crop coordinates as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
352
+
CogView3 uses SDXL-like micro-conditioning for crop coordinates as explained in section 2.2 of
0 commit comments