Skip to content

Commit db92c69

Browse files
committed
test case for omnigen
1 parent 85abe5e commit db92c69

File tree

5 files changed

+374
-306
lines changed

5 files changed

+374
-306
lines changed
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# OmniGen
17+
18+
[OmniGen: Unified Image Generation](https://arxiv.org/pdf/2409.11340) from BAAI, by Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, Zheng Liu.
19+
20+
The abstract from the paper is:
21+
22+
*The emergence of Large Language Models (LLMs) has unified language
23+
generation tasks and revolutionized human-machine interaction.
24+
However, in the realm of image generation, a unified model capable of handling various tasks
25+
within a single framework remains largely unexplored. In
26+
this work, we introduce OmniGen, a new diffusion model
27+
for unified image generation. OmniGen is characterized
28+
by the following features: 1) Unification: OmniGen not
29+
only demonstrates text-to-image generation capabilities but
30+
also inherently supports various downstream tasks, such
31+
as image editing, subject-driven generation, and visual conditional generation. 2) Simplicity: The architecture of
32+
OmniGen is highly simplified, eliminating the need for additional plugins. Moreover, compared to existing diffusion
33+
models, it is more user-friendly and can complete complex
34+
tasks end-to-end through instructions without the need for
35+
extra intermediate steps, greatly simplifying the image generation workflow. 3) Knowledge Transfer: Benefit from
36+
learning in a unified format, OmniGen effectively transfers
37+
knowledge across different tasks, manages unseen tasks and
38+
domains, and exhibits novel capabilities. We also explore
39+
the model’s reasoning capabilities and potential applications of the chain-of-thought mechanism. This work represents the first attempt at a general-purpose image generation model, and we will release our resources at https:
40+
//github.com/VectorSpaceLab/OmniGen to foster future advancements.*
41+
42+
<Tip>
43+
44+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
45+
46+
</Tip>
47+
48+
This pipeline was contributed by [staoxiao](https://github.com/staoxiao). The original codebase can be found [here](https://github.com/VectorSpaceLab/OmniGen). The original weights can be found under [hf.co/shitao](https://huggingface.co/Shitao/OmniGen-v1).
49+
50+
51+
## Inference
52+
53+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
54+
55+
First, load the pipeline:
56+
57+
```python
58+
import torch
59+
from diffusers import CogVideoXPipeline, CogVideoXImageToVideoPipeline
60+
from diffusers.utils import export_to_video,load_image
61+
pipe = CogVideoXPipeline.from_pretrained("THUDM/CogVideoX-5b").to("cuda") # or "THUDM/CogVideoX-2b"
62+
```
63+
64+
If you are using the image-to-video pipeline, load it as follows:
65+
66+
```python
67+
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V").to("cuda")
68+
```
69+
70+
Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
71+
72+
```python
73+
pipe.transformer.to(memory_format=torch.channels_last)
74+
```
75+
76+
Compile the components and run inference:
77+
78+
```python
79+
pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
80+
81+
# CogVideoX works well with long and well-described prompts
82+
prompt = "A panda, dressed in a small, red jacket and a tiny hat, sits on a wooden stool in a serene bamboo forest. The panda's fluffy paws strum a miniature acoustic guitar, producing soft, melodic tunes. Nearby, a few other pandas gather, watching curiously and some clapping in rhythm. Sunlight filters through the tall bamboo, casting a gentle glow on the scene. The panda's face is expressive, showing concentration and joy as it plays. The background includes a small, flowing stream and vibrant green foliage, enhancing the peaceful and magical atmosphere of this unique musical performance."
83+
video = pipe(prompt=prompt, guidance_scale=6, num_inference_steps=50).frames[0]
84+
```
85+
86+
The [T2V benchmark](https://gist.github.com/a-r-r-o-w/5183d75e452a368fd17448fcc810bd3f) results on an 80GB A100 machine are:
87+
88+
```
89+
Without torch.compile(): Average inference time: 96.89 seconds.
90+
With torch.compile(): Average inference time: 76.27 seconds.
91+
```
92+
93+
94+
## CogVideoXPipeline
95+
96+
[[autodoc]] CogVideoXPipeline
97+
- all
98+
- __call__
99+
100+

src/diffusers/pipelines/omnigen/pipeline_omnigen.py

Lines changed: 3 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@
1515
import inspect
1616
from typing import Any, Callable, Dict, List, Optional, Union
1717

18+
import PIL
1819
import numpy as np
1920
import torch
2021
from transformers import LlamaTokenizer
@@ -146,7 +147,7 @@ class OmniGenPipeline(
146147

147148
model_cpu_offload_seq = "transformer->vae"
148149
_optional_components = []
149-
_callback_tensor_inputs = ["latents", "prompt_embeds"]
150+
_callback_tensor_inputs = ["latents", "condition_tokens"]
150151

151152
def __init__(
152153
self,
@@ -361,7 +362,7 @@ def interrupt(self):
361362
def __call__(
362363
self,
363364
prompt: Union[str, List[str]],
364-
input_images: Optional[Union[List[str], List[List[str]]]] = None,
365+
input_images: Optional[Union[List[str], List[PIL.Image.Image], List[List[str]], List[List[PIL.Image.Image]]]] = None,
365366
height: Optional[int] = None,
366367
width: Optional[int] = None,
367368
num_inference_steps: int = 50,
@@ -527,11 +528,6 @@ def __call__(
527528
latents,
528529
)
529530

530-
531-
generator = torch.Generator(device=device).manual_seed(0)
532-
latents = torch.randn(1, 4, height//8, width//8, device=device, generator=generator).to(self.transformer.dtype)
533-
# latents = torch.cat([latents]*(1+num_cfg), 0).to(dtype)
534-
535531
# 7. Prepare OmniGenCache
536532
num_tokens_for_output_img = latents.size(-1) * latents.size(-2) // (self.transformer.patch_size ** 2)
537533
cache = OmniGenCache(num_tokens_for_output_img, offload_kv_cache) if use_kv_cache else None

src/diffusers/pipelines/omnigen/processor_omnigen.py

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -57,7 +57,8 @@ def __init__(self,
5757
self.collator = OmniGenCollator()
5858

5959
def process_image(self, image):
60-
image = Image.open(image).convert('RGB')
60+
if isinstance(image, str):
61+
image = Image.open(image).convert('RGB')
6162
return self.image_transform(image)
6263

6364
def process_multi_modal_prompt(self, text, input_images):

0 commit comments

Comments
 (0)