|
| 1 | +<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. |
| 14 | +--> |
| 15 | + |
| 16 | +# CogVideoX |
| 17 | + |
| 18 | +[CogView3: Finer and Faster Text-to-Image Generation via Relay Diffusion](https://huggingface.co/papers/2403.05121) from Tsinghua University & ZhipuAI, by Wendi Zheng, Jiayan Teng, Zhuoyi Yang, Weihan Wang, Jidong Chen, Xiaotao Gu, Yuxiao Dong, Ming Ding, Jie Tang. |
| 19 | + |
| 20 | +The abstract from the paper is: |
| 21 | + |
| 22 | +*Recent advancements in text-to-image generative systems have been largely driven by diffusion models. However, single-stage text-to-image diffusion models still face challenges, in terms of computational efficiency and the refinement of image details. To tackle the issue, we propose CogView3, an innovative cascaded framework that enhances the performance of text-to-image diffusion. CogView3 is the first model implementing relay diffusion in the realm of text-to-image generation, executing the task by first creating low-resolution images and subsequently applying relay-based super-resolution. This methodology not only results in competitive text-to-image outputs but also greatly reduces both training and inference costs. Our experimental results demonstrate that CogView3 outperforms SDXL, the current state-of-the-art open-source text-to-image diffusion model, by 77.0% in human evaluations, all while requiring only about 1/2 of the inference time. The distilled variant of CogView3 achieves comparable performance while only utilizing 1/10 of the inference time by SDXL.* |
| 23 | + |
| 24 | +<Tip> |
| 25 | + |
| 26 | +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. |
| 27 | + |
| 28 | +</Tip> |
| 29 | + |
| 30 | +This pipeline was contributed by [zRzRzRzRzRzRzR](https://github.com/zRzRzRzRzRzRzR). The original codebase can be found [here](https://huggingface.co/THUDM). The original weights can be found under [hf.co/THUDM](https://huggingface.co/THUDM). |
| 31 | + |
| 32 | +## Inference |
| 33 | + |
| 34 | +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. |
| 35 | + |
| 36 | +First, load the pipeline: |
| 37 | + |
| 38 | +```python |
| 39 | +import torch |
| 40 | +from diffusers import CogView3PlusPipeline |
| 41 | +from diffusers.utils import export_to_video,load_image |
| 42 | + |
| 43 | +pipe = CogView3PlusPipeline.from_pretrained("THUDM/CogView3Plus-3b").to("cuda") # or "THUDM/CogVideoX-2b" |
| 44 | +``` |
| 45 | + |
| 46 | +Then change the memory layout of the `transformer` and `vae` components to `torch.channels_last`: |
| 47 | + |
| 48 | +```python |
| 49 | +pipe.transformer.to(memory_format=torch.channels_last) |
| 50 | +pipe.vae.to(memory_format=torch.channels_last) |
| 51 | +``` |
| 52 | + |
| 53 | +Compile the components and run inference: |
| 54 | + |
| 55 | +```python |
| 56 | +pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) |
| 57 | +pipe.vae.decode = torch.compile(pipeline.vae.decode, mode="max-autotune", fullgraph=True) |
| 58 | + |
| 59 | +# CogVideoX works well with long and well-described prompts |
| 60 | +prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." |
| 61 | +video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] |
| 62 | +``` |
| 63 | + |
| 64 | +The [benchmark](TODO) results on an 80GB A100 machine are: |
| 65 | + |
| 66 | +``` |
| 67 | +Without torch.compile(): Average inference time: TODO seconds. |
| 68 | +With torch.compile(): Average inference time: TODO seconds. |
| 69 | +``` |
| 70 | + |
| 71 | +## CogView3PlusPipeline |
| 72 | + |
| 73 | +[[autodoc]] CogView3PlusPipeline |
| 74 | + - all |
| 75 | + - __call__ |
| 76 | + |
| 77 | +## CogView3PipelineOutput |
| 78 | + |
| 79 | +[[autodoc]] pipelines.cogview3.pipeline_output.CogView3PipelineOutput |
0 commit comments