Skip to content

Commit 25c4925

Browse files
committed
init
1 parent a72bc0c commit 25c4925

25 files changed

+3932
-4409
lines changed

docs/source/en/_toctree.yml

Lines changed: 0 additions & 28 deletions
Original file line numberDiff line numberDiff line change
@@ -203,34 +203,6 @@
203203
- local: optimization/neuron
204204
title: AWS Neuron
205205

206-
- title: Specific pipeline examples
207-
isExpanded: false
208-
sections:
209-
- local: using-diffusers/consisid
210-
title: ConsisID
211-
- local: using-diffusers/sdxl
212-
title: Stable Diffusion XL
213-
- local: using-diffusers/sdxl_turbo
214-
title: SDXL Turbo
215-
- local: using-diffusers/kandinsky
216-
title: Kandinsky
217-
- local: using-diffusers/omnigen
218-
title: OmniGen
219-
- local: using-diffusers/pag
220-
title: PAG
221-
- local: using-diffusers/inference_with_lcm
222-
title: Latent Consistency Model
223-
- local: using-diffusers/shap-e
224-
title: Shap-E
225-
- local: using-diffusers/diffedit
226-
title: DiffEdit
227-
- local: using-diffusers/inference_with_tcd_lora
228-
title: Trajectory Consistency Distillation-LoRA
229-
- local: using-diffusers/svd
230-
title: Stable Video Diffusion
231-
- local: using-diffusers/marigold_usage
232-
title: Marigold Computer Vision
233-
234206
- title: Resources
235207
isExpanded: false
236208
sections:

docs/source/en/api/pipelines/consisid.md

Lines changed: 44 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -40,6 +40,44 @@ There are two official ConsisID checkpoints for identity-preserving text-to-vide
4040
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
4141
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
4242

43+
## Load Model Checkpoints
44+
45+
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
46+
47+
```python
48+
# !pip install consisid_eva_clip insightface facexlib
49+
import torch
50+
from diffusers import ConsisIDPipeline
51+
from diffusers.pipelines.consisid.consisid_utils import prepare_face_models, process_face_embeddings_infer
52+
from huggingface_hub import snapshot_download
53+
54+
# Download ckpts
55+
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
56+
57+
# Load face helper model to preprocess input face image
58+
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
59+
60+
# Load consisid base model
61+
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
62+
pipe.to("cuda")
63+
```
64+
65+
## Identity-Preserving Text-to-Video
66+
67+
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
68+
69+
```python
70+
from diffusers.utils import export_to_video
71+
72+
prompt = "The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
73+
image = "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/consisid/consisid_input.png?download=true"
74+
75+
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
76+
77+
video = pipe(image=image, prompt=prompt, num_inference_steps=50, guidance_scale=6.0, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=face_kps, generator=torch.Generator("cuda").manual_seed(42))
78+
export_to_video(video.frames[0], "output.mp4", fps=8)
79+
```
80+
4381
### Memory optimization
4482

4583
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
@@ -52,6 +90,12 @@ ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of vi
5290
| vae.enable_slicing | 16 GB | 22 GB |
5391
| vae.enable_tiling | 5 GB | 7 GB |
5492

93+
## Resources
94+
95+
Learn more about ConsisID with the following resources.
96+
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
97+
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
98+
5599
## ConsisIDPipeline
56100

57101
[[autodoc]] ConsisIDPipeline

docs/source/en/api/pipelines/diffedit.md

Lines changed: 256 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,262 @@ The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https
2525

2626
This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
2727

28+
The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
29+
30+
```py
31+
source_prompt = "a bowl of fruits"
32+
target_prompt = "a bowl of pears"
33+
```
34+
35+
The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions!
36+
37+
Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage:
38+
39+
```py
40+
import torch
41+
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
42+
43+
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
44+
"stabilityai/stable-diffusion-2-1",
45+
torch_dtype=torch.float16,
46+
safety_checker=None,
47+
use_safetensors=True,
48+
)
49+
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
50+
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
51+
pipeline.enable_model_cpu_offload()
52+
pipeline.enable_vae_slicing()
53+
```
54+
55+
Load the image to edit:
56+
57+
```py
58+
from diffusers.utils import load_image, make_image_grid
59+
60+
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
61+
raw_image = load_image(img_url).resize((768, 768))
62+
raw_image
63+
```
64+
65+
Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
66+
67+
```py
68+
from PIL import Image
69+
70+
source_prompt = "a bowl of fruits"
71+
target_prompt = "a basket of pears"
72+
mask_image = pipeline.generate_mask(
73+
image=raw_image,
74+
source_prompt=source_prompt,
75+
target_prompt=target_prompt,
76+
)
77+
Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
78+
```
79+
80+
Next, create the inverted latents and pass it a caption describing the image:
81+
82+
```py
83+
inv_latents = pipeline.invert(prompt=source_prompt, image=raw_image).latents
84+
```
85+
86+
Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
87+
88+
```py
89+
output_image = pipeline(
90+
prompt=target_prompt,
91+
mask_image=mask_image,
92+
image_latents=inv_latents,
93+
negative_prompt=source_prompt,
94+
).images[0]
95+
mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L").resize((768, 768))
96+
make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
97+
```
98+
99+
<div class="flex gap-4">
100+
<div>
101+
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"/>
102+
<figcaption class="mt-2 text-center text-sm text-gray-500">original image</figcaption>
103+
</div>
104+
<div>
105+
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/blob/main/assets/target.png?raw=true"/>
106+
<figcaption class="mt-2 text-center text-sm text-gray-500">edited image</figcaption>
107+
</div>
108+
</div>
109+
110+
## Generate source and target embeddings
111+
112+
The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually.
113+
114+
Load the Flan-T5 model and tokenizer from the 🤗 Transformers library:
115+
116+
```py
117+
import torch
118+
from transformers import AutoTokenizer, T5ForConditionalGeneration
119+
120+
tokenizer = AutoTokenizer.from_pretrained("google/flan-t5-large")
121+
model = T5ForConditionalGeneration.from_pretrained("google/flan-t5-large", device_map="auto", torch_dtype=torch.float16)
122+
```
123+
124+
Provide some initial text to prompt the model to generate the source and target prompts.
125+
126+
```py
127+
source_concept = "bowl"
128+
target_concept = "basket"
129+
130+
source_text = f"Provide a caption for images containing a {source_concept}. "
131+
"The captions should be in English and should be no longer than 150 characters."
132+
133+
target_text = f"Provide a caption for images containing a {target_concept}. "
134+
"The captions should be in English and should be no longer than 150 characters."
135+
```
136+
137+
Next, create a utility function to generate the prompts:
138+
139+
```py
140+
@torch.no_grad()
141+
def generate_prompts(input_prompt):
142+
input_ids = tokenizer(input_prompt, return_tensors="pt").input_ids.to("cuda")
143+
144+
outputs = model.generate(
145+
input_ids, temperature=0.8, num_return_sequences=16, do_sample=True, max_new_tokens=128, top_k=10
146+
)
147+
return tokenizer.batch_decode(outputs, skip_special_tokens=True)
148+
149+
source_prompts = generate_prompts(source_text)
150+
target_prompts = generate_prompts(target_text)
151+
print(source_prompts)
152+
print(target_prompts)
153+
```
154+
155+
<Tip>
156+
157+
Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text.
158+
159+
</Tip>
160+
161+
Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings:
162+
163+
```py
164+
import torch
165+
from diffusers import StableDiffusionDiffEditPipeline
166+
167+
pipeline = StableDiffusionDiffEditPipeline.from_pretrained(
168+
"stabilityai/stable-diffusion-2-1", torch_dtype=torch.float16, use_safetensors=True
169+
)
170+
pipeline.enable_model_cpu_offload()
171+
pipeline.enable_vae_slicing()
172+
173+
@torch.no_grad()
174+
def embed_prompts(sentences, tokenizer, text_encoder, device="cuda"):
175+
embeddings = []
176+
for sent in sentences:
177+
text_inputs = tokenizer(
178+
sent,
179+
padding="max_length",
180+
max_length=tokenizer.model_max_length,
181+
truncation=True,
182+
return_tensors="pt",
183+
)
184+
text_input_ids = text_inputs.input_ids
185+
prompt_embeds = text_encoder(text_input_ids.to(device), attention_mask=None)[0]
186+
embeddings.append(prompt_embeds)
187+
return torch.concatenate(embeddings, dim=0).mean(dim=0).unsqueeze(0)
188+
189+
source_embeds = embed_prompts(source_prompts, pipeline.tokenizer, pipeline.text_encoder)
190+
target_embeds = embed_prompts(target_prompts, pipeline.tokenizer, pipeline.text_encoder)
191+
```
192+
193+
Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image:
194+
195+
```diff
196+
from diffusers import DDIMInverseScheduler, DDIMScheduler
197+
from diffusers.utils import load_image, make_image_grid
198+
from PIL import Image
199+
200+
pipeline.scheduler = DDIMScheduler.from_config(pipeline.scheduler.config)
201+
pipeline.inverse_scheduler = DDIMInverseScheduler.from_config(pipeline.scheduler.config)
202+
203+
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
204+
raw_image = load_image(img_url).resize((768, 768))
205+
206+
mask_image = pipeline.generate_mask(
207+
image=raw_image,
208+
- source_prompt=source_prompt,
209+
- target_prompt=target_prompt,
210+
+ source_prompt_embeds=source_embeds,
211+
+ target_prompt_embeds=target_embeds,
212+
)
213+
214+
inv_latents = pipeline.invert(
215+
- prompt=source_prompt,
216+
+ prompt_embeds=source_embeds,
217+
image=raw_image,
218+
).latents
219+
220+
output_image = pipeline(
221+
mask_image=mask_image,
222+
image_latents=inv_latents,
223+
- prompt=target_prompt,
224+
- negative_prompt=source_prompt,
225+
+ prompt_embeds=target_embeds,
226+
+ negative_prompt_embeds=source_embeds,
227+
).images[0]
228+
mask_image = Image.fromarray((mask_image.squeeze()*255).astype("uint8"), "L")
229+
make_image_grid([raw_image, mask_image, output_image], rows=1, cols=3)
230+
```
231+
232+
## Generate a caption for inversion
233+
234+
While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption.
235+
236+
Load the BLIP model and processor from the 🤗 Transformers library:
237+
238+
```py
239+
import torch
240+
from transformers import BlipForConditionalGeneration, BlipProcessor
241+
242+
processor = BlipProcessor.from_pretrained("Salesforce/blip-image-captioning-base")
243+
model = BlipForConditionalGeneration.from_pretrained("Salesforce/blip-image-captioning-base", torch_dtype=torch.float16, low_cpu_mem_usage=True)
244+
```
245+
246+
Create a utility function to generate a caption from the input image:
247+
248+
```py
249+
@torch.no_grad()
250+
def generate_caption(images, caption_generator, caption_processor):
251+
text = "a photograph of"
252+
253+
inputs = caption_processor(images, text, return_tensors="pt").to(device="cuda", dtype=caption_generator.dtype)
254+
caption_generator.to("cuda")
255+
outputs = caption_generator.generate(**inputs, max_new_tokens=128)
256+
257+
# offload caption generator
258+
caption_generator.to("cpu")
259+
260+
caption = caption_processor.batch_decode(outputs, skip_special_tokens=True)[0]
261+
return caption
262+
```
263+
264+
Load an input image and generate a caption for it using the `generate_caption` function:
265+
266+
```py
267+
from diffusers.utils import load_image
268+
269+
img_url = "https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"
270+
raw_image = load_image(img_url).resize((768, 768))
271+
caption = generate_caption(raw_image, model, processor)
272+
```
273+
274+
<div class="flex justify-center">
275+
<figure>
276+
<img class="rounded-xl" src="https://github.com/Xiang-cd/DiffEdit-stable-diffusion/raw/main/assets/origin.png"/>
277+
<figcaption class="text-center">generated caption: "a photograph of a bowl of fruit on a table"</figcaption>
278+
</figure>
279+
</div>
280+
281+
Now you can drop the caption into the [`~StableDiffusionDiffEditPipeline.invert`] function to generate the partially inverted latents!
282+
283+
28284
## Tips
29285

30286
* The pipeline can generate masks that can be fed into other inpainting pipelines.

0 commit comments

Comments
 (0)