Skip to content

Commit c78d1f4

Browse files
committed
offload_transformer
1 parent d9f80fc commit c78d1f4

File tree

2 files changed

+270
-12
lines changed

2 files changed

+270
-12
lines changed

docs/source/en/using-diffusers/omnigen.md

Lines changed: 268 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -15,6 +15,7 @@ OmniGen is an image generation model. Unlike existing text-to-image models, Omni
1515
- Minimalist model architecture, consisting of only a VAE and a transformer module, for joint modeling of text and images.
1616
- Support for multimodal inputs. It can process any text-image mixed data as instructions for image generation, rather than relying solely on text.
1717

18+
For more information, please refer to the [paper](https://arxiv.org/pdf/2409.11340).
1819
This guide will walk you through using OmniGen for various tasks and use cases.
1920

2021
## Load model checkpoints
@@ -30,8 +31,6 @@ pipe = OmniGenPipeline.from_pretrained(
3031
```
3132

3233

33-
## Text-to-Image
34-
3534

3635
## Text-to-image
3736

@@ -41,30 +40,289 @@ You can try setting the `height` and `width` parameters to generate images with
4140
```py
4241
import torch
4342
from diffusers import OmniGenPipeline
43+
4444
pipe = OmniGenPipeline.from_pretrained(
4545
"Shitao/OmniGen-v1-diffusers",
4646
torch_dtype=torch.bfloat16
4747
)
48+
pipe.to("cuda")
4849

49-
prompt = "A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
50-
pipe.enable_model_cpu_offload()
51-
50+
prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
5251
image = pipe(
5352
prompt=prompt,
54-
generator=torch.Generator(device="cuda").manual_seed(42),
53+
height=1024,
54+
width=1024,
55+
guidance_scale=3,
56+
generator=torch.Generator(device="cpu").manual_seed(111),
5557
).images[0]
58+
image
59+
```
60+
<div class="flex justify-center">
61+
<img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png" alt="generated image"/>
62+
</div>
63+
64+
## Image edit
65+
66+
OmniGen supports for multimodal inputs.
67+
When the input includes an image, you need to add a placeholder `<img><|image_1|></img>` in the text prompt to represent the image.
68+
It is recommended to enable 'use_input_image_size_as_output' to keep the edited image the same size as the original image.
69+
70+
```py
71+
import torch
72+
from diffusers import OmniGenPipeline
73+
from diffusers.utils import load_image
5674

75+
pipe = OmniGenPipeline.from_pretrained(
76+
"Shitao/OmniGen-v1-diffusers",
77+
torch_dtype=torch.bfloat16
78+
)
79+
pipe.to("cuda")
80+
81+
prompt="<img><|image_1|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."
82+
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")]
83+
image = pipe(
84+
prompt=prompt,
85+
input_images=input_images,
86+
guidance_scale=2,
87+
img_guidance_scale=1.6,
88+
use_input_image_size_as_output=True,
89+
generator=torch.Generator(device="cpu").manual_seed(222)).images[0]
90+
image
91+
```
92+
<div class="flex flex-row gap-4">
93+
<div class="flex-1">
94+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png"/>
95+
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
96+
</div>
97+
<div class="flex-1">
98+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"/>
99+
<figcaption class="mt-2 text-center text-sm text-gray-500">edited image</figcaption>
100+
</div>
101+
</div>
102+
103+
OmniGen has some interesting features, such as the ability to infer user needs, as shown in the example below.
104+
```py
105+
prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <img><|image_1|></img>"
106+
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
107+
image = pipe(
108+
prompt=prompt,
109+
input_images=input_images,
110+
guidance_scale=2,
111+
img_guidance_scale=1.6,
112+
use_input_image_size_as_output=True,
113+
generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
57114
image
58115
```
59116
<div class="flex justify-center">
60-
<img src="https://github.com/VectorSpaceLab/OmniGen/blob/main/imgs/demo_cases/t2i_woman_with_book.png" alt="generated image of an astronaut in a jungle"/>
117+
<img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/reasoning.png" alt="generated image"/>
61118
</div>
62-
For text-to-image, pass a text prompt. By default, CogVideoX generates a 720x480 video for the best results.
63119

64120

121+
## Controllable generation
122+
123+
OmniGen can handle several classic computer vision tasks.
124+
As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images.
125+
126+
```py
127+
import torch
128+
from diffusers import OmniGenPipeline
129+
from diffusers.utils import load_image
130+
131+
pipe = OmniGenPipeline.from_pretrained(
132+
"Shitao/OmniGen-v1-diffusers",
133+
torch_dtype=torch.bfloat16
134+
)
135+
pipe.to("cuda")
136+
137+
prompt="Detect the skeleton of human in this image: <img><|image_1|></img>"
138+
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
139+
image1 = pipe(
140+
prompt=prompt,
141+
input_images=input_images,
142+
guidance_scale=2,
143+
img_guidance_scale=1.6,
144+
use_input_image_size_as_output=True,
145+
generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
146+
image1
147+
148+
prompt="Generate a new photo using the following picture and text as conditions: <img><|image_1|></img>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
149+
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")]
150+
image2 = pipe(
151+
prompt=prompt,
152+
input_images=input_images,
153+
guidance_scale=2,
154+
img_guidance_scale=1.6,
155+
use_input_image_size_as_output=True,
156+
generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
157+
image2
158+
```
159+
160+
<div class="flex flex-row gap-4">
161+
<div class="flex-1">
162+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"/>
163+
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
164+
</div>
165+
<div class="flex-1">
166+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png"/>
167+
<figcaption class="mt-2 text-center text-sm text-gray-500">detected skeleton</figcaption>
168+
</div>
169+
<div class="flex-1">
170+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal2img.png"/>
171+
<figcaption class="mt-2 text-center text-sm text-gray-500">skeleton to image</figcaption>
172+
</div>
173+
</div>
174+
175+
176+
OmniGen can also directly use relevant information from input images to generate new images.
177+
```py
178+
import torch
179+
from diffusers import OmniGenPipeline
180+
from diffusers.utils import load_image
181+
182+
pipe = OmniGenPipeline.from_pretrained(
183+
"Shitao/OmniGen-v1-diffusers",
184+
torch_dtype=torch.bfloat16
185+
)
186+
pipe.to("cuda")
187+
188+
prompt="Following the pose of this image <img><|image_1|></img>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
189+
input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
190+
image = pipe(
191+
prompt=prompt,
192+
input_images=input_images,
193+
guidance_scale=2,
194+
img_guidance_scale=1.6,
195+
use_input_image_size_as_output=True,
196+
generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
197+
image
198+
```
199+
<div class="flex flex-row gap-4">
200+
<div class="flex-1">
201+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/same_pose.png"/>
202+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
203+
</div>
204+
</div>
205+
206+
207+
## ID and object preserving
208+
209+
OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously.
210+
Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions.
211+
212+
```py
213+
import torch
214+
from diffusers import OmniGenPipeline
215+
from diffusers.utils import load_image
216+
217+
pipe = OmniGenPipeline.from_pretrained(
218+
"Shitao/OmniGen-v1-diffusers",
219+
torch_dtype=torch.bfloat16
220+
)
221+
pipe.to("cuda")
222+
223+
prompt="A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <img><|image_1|></img>. The woman is the woman on the left of <img><|image_2|></img>"
224+
input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.jpg")
225+
input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.jpg")
226+
input_images=[input_image_1, input_image_2]
227+
image = pipe(
228+
prompt=prompt,
229+
input_images=input_images,
230+
height=1024,
231+
width=1024,
232+
guidance_scale=2.5,
233+
img_guidance_scale=1.6,
234+
generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
235+
image
236+
```
237+
<div class="flex flex-row gap-4">
238+
<div class="flex-1">
239+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png"/>
240+
<figcaption class="mt-2 text-center text-sm text-gray-500">input_image_1</figcaption>
241+
</div>
242+
<div class="flex-1">
243+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png"/>
244+
<figcaption class="mt-2 text-center text-sm text-gray-500">input_image_2</figcaption>
245+
</div>
246+
<div class="flex-1">
247+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/id2.png"/>
248+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
249+
</div>
250+
</div>
251+
252+
253+
```py
254+
import torch
255+
from diffusers import OmniGenPipeline
256+
from diffusers.utils import load_image
257+
258+
pipe = OmniGenPipeline.from_pretrained(
259+
"Shitao/OmniGen-v1-diffusers",
260+
torch_dtype=torch.bfloat16
261+
)
262+
pipe.to("cuda")
263+
264+
265+
prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <img><|image_1|></img>. The long-sleeve blouse and a pleated skirt are <img><|image_2|></img>."
266+
input_image_1 = load_image("/share/junjie/code/VISTA2/produce_data/laion_net/diffgpt/OmniGen/docs_img/emma.jpeg")
267+
input_image_2 = load_image("/share/junjie/code/VISTA2/produce_data/laion_net/diffgpt/OmniGen/docs_img/dress.jpg")
268+
input_images=[input_image_1, input_image_2]
269+
image = pipe(
270+
prompt=prompt,
271+
input_images=input_images,
272+
height=1024,
273+
width=1024,
274+
guidance_scale=2.5,
275+
img_guidance_scale=1.6,
276+
generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
277+
```
278+
279+
<div class="flex flex-row gap-4">
280+
<div class="flex-1">
281+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg"/>
282+
<figcaption class="mt-2 text-center text-sm text-gray-500">person image</figcaption>
283+
</div>
284+
<div class="flex-1">
285+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg"/>
286+
<figcaption class="mt-2 text-center text-sm text-gray-500">clothe image</figcaption>
287+
</div>
288+
<div class="flex-1">
289+
<img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/tryon.png"/>
290+
<figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
291+
</div>
292+
</div>
293+
294+
295+
## Optimization when inputting multiple images
296+
297+
For text-to-image task, OmniGen requires minimal memory and time costs (9G memory and 31s for a 1024*1024 image on A800 GPU).
298+
However, when using input images, the computational cost increases.
299+
300+
Here are some guidelines to help you reduce computational costs when input multiple images. The experiments are conducted on A800 GPU and input two images to OmniGen.
65301

66-
## Optimization
67302

68303
### inference speed
69304

70-
### Memory
305+
- `use_kv_cache=True`:
306+
`use_kv_cache` will store key and value states of the input conditions to compute attention without redundant computations.
307+
The default value is True, and OmniGen will offload the kv cache to cpu default.
308+
- `use_kv_cache=False`: the inference time is 3m21s.
309+
- `use_kv_cache=True`: the inference time is 1m30s.
310+
311+
- `max_input_image_size`:
312+
the maximum size of input image, which will be used to crop the input image
313+
- `max_input_image_size=1024`: the inference time is 1m30s.
314+
- `max_input_image_size=512`: the inference time is 58s.
315+
316+
### Memory
317+
318+
- `pipe.enable_model_cpu_offload()`:
319+
- Without enabling cpu offloading, memory usage is `31 GB`
320+
- With enabling cpu offloading, memory usage is `28 GB`
321+
322+
- `offload_transformer_block=True`:
323+
- 17G
324+
325+
- `pipe.enable_sequential_cpu_offload()`:
326+
- 11G
327+
328+

src/diffusers/models/transformers/transformer_omnigen.py

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -174,7 +174,7 @@ def forward(
174174
)
175175
else:
176176
if offload_transformer_block and not self.training:
177-
if not not torch.cuda.is_available():
177+
if not torch.cuda.is_available():
178178
logger.warning_once(
179179
"We don't detecte any available GPU, so diable `offload_transformer_block`"
180180
)
@@ -363,7 +363,7 @@ def get_multimodal_embeddings(self,
363363
input_img_inx = 0
364364
if input_img_latents is not None:
365365
input_image_tokens = self.patch_embedding(input_img_latents,
366-
is_input_image=True)
366+
is_input_image=True)
367367

368368
for b_inx in input_image_sizes.keys():
369369
for start_inx, end_inx in input_image_sizes[b_inx]:

0 commit comments

Comments
 (0)