|
| 1 | +<!--Copyright 2024 The HuggingFace Team. All rights reserved. |
| 2 | +# |
| 3 | +# Licensed under the Apache License, Version 2.0 (the "License"); |
| 4 | +# you may not use this file except in compliance with the License. |
| 5 | +# You may obtain a copy of the License at |
| 6 | +# |
| 7 | +# http://www.apache.org/licenses/LICENSE-2.0 |
| 8 | +# |
| 9 | +# Unless required by applicable law or agreed to in writing, software |
| 10 | +# distributed under the License is distributed on an "AS IS" BASIS, |
| 11 | +# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. |
| 12 | +# See the License for the specific language governing permissions and |
| 13 | +# limitations under the License. |
| 14 | +--> |
| 15 | + |
| 16 | +# OmniGen |
| 17 | + |
| 18 | +[OmniGen: Unified Image Generation](https://arxiv.org/pdf/2409.11340) from BAAI, by Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu. |
| 19 | + |
| 20 | +The abstract from the paper is: |
| 21 | + |
| 22 | +*The emergence of Large Language Models (LLMs) has unified language |
| 23 | +generation tasks and revolutionized human-machine interaction. |
| 24 | +However, in the realm of image generation, a unified model capable of handling various tasks |
| 25 | +within a single framework remains largely unexplored. In |
| 26 | +this work, we introduce OmniGen, a new diffusion model |
| 27 | +for unified image generation. OmniGen is characterized |
| 28 | +by the following features: 1) Unification: OmniGen not |
| 29 | +only demonstrates text-to-image generation capabilities but |
| 30 | +also inherently supports various downstream tasks, such |
| 31 | +as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of |
| 32 | +OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion |
| 33 | +models, it is more user-friendly and can complete complex |
| 34 | +tasks end-to-end through instructions without the need for |
| 35 | +extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from |
| 36 | +learning in a unified format, OmniGen effectively transfers |
| 37 | +knowledge across different tasks, manages unseen tasks and |
| 38 | +domains, and exhibits novel capabilities. We also explore |
| 39 | +the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https: |
| 40 | +//github.com/VectorSpaceLab/OmniGen to foster future advancements.* |
| 41 | + |
| 42 | +<Tip> |
| 43 | + |
| 44 | +Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines. |
| 45 | + |
| 46 | +</Tip> |
| 47 | + |
| 48 | +This pipeline was contributed by [staoxiao](https://github.com/staoxiao). The original codebase can be found [here](https://github.com/VectorSpaceLab/OmniGen). The original weights can be found under [hf.co/shitao](https://huggingface.co/Shitao/OmniGen-v1). |
| 49 | + |
| 50 | + |
| 51 | +## Inference |
| 52 | + |
| 53 | +Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency. |
| 54 | + |
| 55 | +First, load the pipeline: |
| 56 | + |
| 57 | +```python |
| 58 | +import torch |
| 59 | +from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline |
| 60 | +from diffusers.utils import export_to_video,load_image |
| 61 | +pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b").to("cuda") # or "THUDM/CogVideoX-2b" |
| 62 | +``` |
| 63 | + |
| 64 | +If you are using the image-to-video pipeline, load it as follows: |
| 65 | + |
| 66 | +```python |
| 67 | +pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda") |
| 68 | +``` |
| 69 | + |
| 70 | +Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`: |
| 71 | + |
| 72 | +```python |
| 73 | +pipe.transformer.to(memory_format=torch.channels_last) |
| 74 | +``` |
| 75 | + |
| 76 | +Compile the components and run inference: |
| 77 | + |
| 78 | +```python |
| 79 | +pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True) |
| 80 | + |
| 81 | +# CogVideoX works well with long and well-described prompts |
| 82 | +prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance." |
| 83 | +video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0] |
| 84 | +``` |
| 85 | + |
| 86 | +The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are: |
| 87 | + |
| 88 | +``` |
| 89 | +Without torch.compile(): Average inference time: 96.89 seconds. |
| 90 | +With torch.compile(): Average inference time: 76.27 seconds. |
| 91 | +``` |
| 92 | + |
| 93 | + |
| 94 | +## CogVideoXPipeline |
| 95 | + |
| 96 | +[[autodoc]] CogVideoXPipeline |
| 97 | + - all |
| 98 | + - __call__ |
| 99 | + |
| 100 | + |
0 commit comments