You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: docs/source/en/api/pipelines/cogview3.md
+1-40Lines changed: 1 addition & 40 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -13,7 +13,7 @@
13
13
# limitations under the License.
14
14
-->
15
15
16
-
# CogVideoX
16
+
# CogView3Plus
17
17
18
18
[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang.
19
19
@@ -29,45 +29,6 @@ Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.m
29
29
30
30
This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM).
31
31
32
-
## Inference
33
-
34
-
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
35
-
36
-
First, load the pipeline:
37
-
38
-
```python
39
-
import torch
40
-
from diffusers import CogView3PlusPipeline
41
-
from diffusers.utils import export_to_video,load_image
42
-
43
-
pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3Plus-3b").to("cuda") # or "THUDM/CogVideoX-2b"
44
-
```
45
-
46
-
Then change the memory layout of the `transformer` and `vae` components to `torch.channels_last`:
# CogVideoX works well with long and well-described prompts
60
-
prompt ="A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
61
-
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
62
-
```
63
-
64
-
The [benchmark](TODO) results on an 80GB A100 machine are:
65
-
66
-
```
67
-
Without torch.compile(): Average inference time: TODO seconds.
68
-
With torch.compile(): Average inference time: TODO seconds.
y (`torch.LongTensor`, *optional*): 标签输入,用于获取标签嵌入。
324
-
block_controlnet_hidden_states: (`list` of `torch.Tensor`): A list of tensors for residuals.
325
-
joint_attention_kwargs (`dict`, *optional*): Additional kwargs for the attention processor.
326
-
return_dict (`bool`, *optional*, defaults to `True`): Whether to return a `Transformer2DModelOutput`.
336
+
hidden_states (`torch.Tensor`):
337
+
Input `hidden_states` of shape `(batch size, channel, height, width)`.
338
+
encoder_hidden_states (`torch.Tensor`):
339
+
Conditional embeddings (embeddings computed from the input conditions such as prompts)
340
+
of shape `(batch_size, sequence_len, text_embed_dim)`
341
+
timestep (`torch.LongTensor`):
342
+
Used to indicate denoising step.
343
+
original_size (`torch.Tensor`):
344
+
CogView3 uses SDXL-like micro-conditioning for original image size as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
345
+
target_size (`torch.Tensor`):
346
+
CogView3 uses SDXL-like micro-conditioning for target image size as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
347
+
crop_coords (`torch.Tensor`):
348
+
CogView3 uses SDXL-like micro-conditioning for crop coordinates as explained in section 2.2 of [https://huggingface.co/papers/2307.01952](https://huggingface.co/papers/2307.01952).
349
+
return_dict (`bool`, *optional*, defaults to `True`):
350
+
Whether or not to return a [`~models.transformer_2d.Transformer2DModelOutput`] instead of a plain
351
+
tuple.
327
352
328
353
Returns:
329
-
Output tensor or `Transformer2DModelOutput`.
354
+
`torch.Tensor` or [`~models.transformer_2d.Transformer2DModelOutput`]:
355
+
The denoised latents using provided inputs as conditioning.
330
356
"""
331
357
height, width=hidden_states.shape[-2:]
332
358
text_seq_length=encoder_hidden_states.shape[1]
333
359
334
-
hidden_states=self.pos_embed(
360
+
hidden_states=self.patch_embed(
335
361
hidden_states, encoder_hidden_states
336
362
) # takes care of adding positional embeddings too.
0 commit comments