offload_transformer

staoxiao · staoxiao · commit c78d1f4b4909 · 2024-12-06T18:24:58.000+08:00
diff --git a/docs/source/en/using-diffusers/omnigen.md b/docs/source/en/using-diffusers/omnigen.md
@@ -15,6 +15,7 @@ OmniGen is an image generation model. Unlike existing text-to-image models, Omni
 - Minimalist model architecture, consisting of only a VAE and a transformer module, for joint modeling of text and images.
 - Support for multimodal inputs. It can process any text-image mixed data as instructions for image generation, rather than relying solely on text.
 
+For more information, please refer to the [paper](https://arxiv.org/pdf/2409.11340).
 This guide will walk you through using OmniGen for various tasks and use cases.
 
 ## Load model checkpoints
@@ -30,8 +31,6 @@ pipe = OmniGenPipeline.from_pretrained(
 ```
 
 
-## Text-to-Image
-
 
 ## Text-to-image
 
@@ -41,30 +40,289 @@ You can try setting the `height` and `width` parameters to generate images with
 ```py
 import torch
 from diffusers import OmniGenPipeline
+
 pipe = OmniGenPipeline.from_pretrained(
     "Shitao/OmniGen-v1-diffusers",
     torch_dtype=torch.bfloat16
 )
+pipe.to("cuda")
 
-prompt = "A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
-pipe.enable_model_cpu_offload()
-
+prompt = "Realistic photo. A young woman sits on a sofa, holding a book and facing the camera. She wears delicate silver hoop earrings adorned with tiny, sparkling diamonds that catch the light, with her long chestnut hair cascading over her shoulders. Her eyes are focused and gentle, framed by long, dark lashes. She is dressed in a cozy cream sweater, which complements her warm, inviting smile. Behind her, there is a table with a cup of water in a sleek, minimalist blue mug. The background is a serene indoor setting with soft natural light filtering through a window, adorned with tasteful art and flowers, creating a cozy and peaceful ambiance. 4K, HD."
 image = pipe(
     prompt=prompt,
-    generator=torch.Generator(device="cuda").manual_seed(42),
+    height=1024,
+    width=1024,
+    guidance_scale=3,
+    generator=torch.Generator(device="cpu").manual_seed(111),
 ).images[0]
+image
+```
+<div class="flex justify-center">
+    <img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png" alt="generated image"/>
+</div>
+
+## Image edit
+
+OmniGen supports for multimodal inputs. 
+When the input includes an image, you need to add a placeholder `<img><|image_1|></img>` in the text prompt to represent the image. 
+It is recommended to enable 'use_input_image_size_as_output' to keep the edited image the same size as the original image.
+
+```py
+import torch
+from diffusers import OmniGenPipeline
+from diffusers.utils import load_image 
 
+pipe = OmniGenPipeline.from_pretrained(
+    "Shitao/OmniGen-v1-diffusers",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+
+prompt="<img><|image_1|></img> Remove the woman's earrings. Replace the mug with a clear glass filled with sparkling iced cola."
+input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png")]
+image = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    guidance_scale=2, 
+    img_guidance_scale=1.6,
+    use_input_image_size_as_output=True,
+    generator=torch.Generator(device="cpu").manual_seed(222)).images[0]
+image
+```
+<div class="flex flex-row gap-4">
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/t2i_woman_with_book.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">edited image</figcaption>
+  </div>
+</div>
+
+OmniGen has some interesting features, such as the ability to infer user needs, as shown in the example below.
+```py
+prompt="If the woman is thirsty, what should she take? Find it in the image and highlight it in blue. <img><|image_1|></img>"
+input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
+image = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    guidance_scale=2, 
+    img_guidance_scale=1.6,
+    use_input_image_size_as_output=True,
+    generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
 image
 ```
 <div class="flex justify-center">
-    <img src="https://github.com/VectorSpaceLab/OmniGen/blob/main/imgs/demo_cases/t2i_woman_with_book.png" alt="generated image of an astronaut in a jungle"/>
+    <img src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/reasoning.png" alt="generated image"/>
 </div>
-For text-to-image, pass a text prompt. By default, CogVideoX generates a 720x480 video for the best results.
 
 
+## Controllable generation
+
+ OmniGen can handle several classic computer vision tasks. 
+ As shown below, OmniGen can detect human skeletons in input images, which can be used as control conditions to generate new images.
+
+```py
+import torch
+from diffusers import OmniGenPipeline
+from diffusers.utils import load_image 
+
+pipe = OmniGenPipeline.from_pretrained(
+    "Shitao/OmniGen-v1-diffusers",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+
+prompt="Detect the skeleton of human in this image: <img><|image_1|></img>"
+input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
+image1 = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    guidance_scale=2, 
+    img_guidance_scale=1.6,
+    use_input_image_size_as_output=True,
+    generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
+image1
+
+prompt="Generate a new photo using the following picture and text as conditions: <img><|image_1|></img>\n A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
+input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png")]
+image2 = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    guidance_scale=2, 
+    img_guidance_scale=1.6,
+    use_input_image_size_as_output=True,
+    generator=torch.Generator(device="cpu").manual_seed(333)).images[0]
+image2
+```
+
+<div class="flex flex-row gap-4">
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">detected skeleton</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/skeletal2img.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">skeleton to image</figcaption>
+  </div>
+</div>
+
+
+OmniGen can also directly use relevant information from input images to generate new images.
+```py
+import torch
+from diffusers import OmniGenPipeline
+from diffusers.utils import load_image 
+
+pipe = OmniGenPipeline.from_pretrained(
+    "Shitao/OmniGen-v1-diffusers",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+
+prompt="Following the pose of this image <img><|image_1|></img>, generate a new photo: A young boy is sitting on a sofa in the library, holding a book. His hair is neatly combed, and a faint smile plays on his lips, with a few freckles scattered across his cheeks. The library is quiet, with rows of shelves filled with books stretching out behind him."
+input_images=[load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/edit.png")]
+image = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    guidance_scale=2, 
+    img_guidance_scale=1.6,
+    use_input_image_size_as_output=True,
+    generator=torch.Generator(device="cpu").manual_seed(0)).images[0]
+image
+```
+<div class="flex flex-row gap-4">
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/same_pose.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
+  </div>
+</div>
+
+
+## ID and object preserving
+
+OmniGen can generate multiple images based on the people and objects in the input image and supports inputting multiple images simultaneously. 
+Additionally, OmniGen can extract desired objects from an image containing multiple objects based on instructions.
+
+```py
+import torch
+from diffusers import OmniGenPipeline
+from diffusers.utils import load_image 
+
+pipe = OmniGenPipeline.from_pretrained(
+    "Shitao/OmniGen-v1-diffusers",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+
+prompt="A man and a woman are sitting at a classroom desk. The man is the man with yellow hair in <img><|image_1|></img>. The woman is the woman on the left of <img><|image_2|></img>"
+input_image_1 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.jpg")
+input_image_2 = load_image("https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.jpg")
+input_images=[input_image_1, input_image_2]
+image = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    height=1024,
+    width=1024,
+    guidance_scale=2.5, 
+    img_guidance_scale=1.6,
+    generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
+image
+```
+<div class="flex flex-row gap-4">
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/3.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">input_image_1</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/4.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">input_image_2</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/id2.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
+  </div>
+</div>
+
+
+```py
+import torch
+from diffusers import OmniGenPipeline
+from diffusers.utils import load_image 
+
+pipe = OmniGenPipeline.from_pretrained(
+    "Shitao/OmniGen-v1-diffusers",
+    torch_dtype=torch.bfloat16
+)
+pipe.to("cuda")
+
+
+prompt="A woman is walking down the street, wearing a white long-sleeve blouse with lace details on the sleeves, paired with a blue pleated skirt. The woman is <img><|image_1|></img>. The long-sleeve blouse and a pleated skirt are <img><|image_2|></img>."
+input_image_1 = load_image("/share/junjie/code/VISTA2/produce_data/laion_net/diffgpt/OmniGen/docs_img/emma.jpeg")
+input_image_2 = load_image("/share/junjie/code/VISTA2/produce_data/laion_net/diffgpt/OmniGen/docs_img/dress.jpg")
+input_images=[input_image_1, input_image_2]
+image = pipe(
+    prompt=prompt, 
+    input_images=input_images, 
+    height=1024,
+    width=1024,
+    guidance_scale=2.5, 
+    img_guidance_scale=1.6,
+    generator=torch.Generator(device="cpu").manual_seed(666)).images[0]
+```
+
+<div class="flex flex-row gap-4">
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/emma.jpeg"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">person image</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/dress.jpg"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">clothe image</figcaption>
+  </div>
+  <div class="flex-1">
+    <img class="rounded-xl" src="https://raw.githubusercontent.com/VectorSpaceLab/OmniGen/main/imgs/docs_img/tryon.png"/>
+    <figcaption class="mt-2 text-center text-sm text-gray-500">generated image</figcaption>
+  </div>
+</div>
+
+
+## Optimization when inputting multiple images 
+
+For text-to-image task, OmniGen requires minimal memory and time costs (9G memory and 31s for a 1024*1024 image on A800 GPU). 
+However, when using input images, the computational cost increases. 
+
+Here are some guidelines to help you reduce computational costs when input multiple images. The experiments are conducted on A800 GPU and input two images to OmniGen.
 
-## Optimization
 
 ### inference speed
 
-### Memory 
+- `use_kv_cache=True`:   
+   `use_kv_cache` will store key and value states of the input conditions to compute attention without redundant computations. 
+    The default value is True, and OmniGen will offload the kv cache to cpu default.
+  - `use_kv_cache=False`: the inference time is 3m21s.
+  - `use_kv_cache=True`: the inference time is 1m30s.
+
+- `max_input_image_size`:   
+  the maximum size of input image, which will be used to crop the input image
+  - `max_input_image_size=1024`: the inference time is 1m30s.
+  - `max_input_image_size=512`: the inference time is 58s.
+
+### Memory 
+
+- `pipe.enable_model_cpu_offload()`:
+  - Without enabling cpu offloading, memory usage is `31 GB`
+  - With enabling cpu offloading, memory usage is `28 GB`
+
+- `offload_transformer_block=True`:
+  - 17G
+
+- `pipe.enable_sequential_cpu_offload()`:
+  - 11G
+
+
diff --git a/src/diffusers/models/transformers/transformer_omnigen.py b/src/diffusers/models/transformers/transformer_omnigen.py
@@ -174,7 +174,7 @@ def forward(
                 )
             else:
                 if offload_transformer_block and not self.training:
-                    if not not torch.cuda.is_available():
+                    if not torch.cuda.is_available():
                         logger.warning_once(
                             "We don't detecte any available GPU, so diable `offload_transformer_block`"
                         )
@@ -363,7 +363,7 @@ def get_multimodal_embeddings(self,
             input_img_inx = 0
             if input_img_latents is not None:
                 input_image_tokens = self.patch_embedding(input_img_latents,
-                                                                      is_input_image=True)
+                                                          is_input_image=True)
 
                 for b_inx in input_image_sizes.keys():
                     for start_inx, end_inx in input_image_sizes[b_inx]: