You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
68
+
69
+
```python
70
+
from diffusers.utils import export_to_video
71
+
72
+
prompt ="The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured, timeless quality to the image, evoking a sense of nostalgia. Around him, the cityscape is filled with vintage buildings, cobblestone sidewalks, and softly blurred figures passing by, their outlines faint and indistinct. Streetlights cast a gentle glow, while shadows play across the boy's path, adding depth to the scene. The lighting highlights the boy's subtle smile, hinting at a fleeting moment of curiosity. The overall cinematic atmosphere, complete with classic film still aesthetics and dramatic contrasts, gives the scene an evocative and introspective feel."
ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/SHYuanBest/bc4207c36f454f9e969adbb50eaf8258) script.
@@ -52,6 +90,12 @@ ConsisID requires about 44 GB of GPU memory to decode 49 frames (6 seconds of vi
52
90
| vae.enable_slicing | 16 GB | 22 GB |
53
91
| vae.enable_tiling | 5 GB | 7 GB |
54
92
93
+
## Resources
94
+
95
+
Learn more about ConsisID with the following resources.
96
+
- A [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) demonstrating ConsisID's main features.
97
+
- The research paper, [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://hf.co/papers/2411.17440) for more details.
Copy file name to clipboardExpand all lines: docs/source/en/api/pipelines/diffedit.md
+256Lines changed: 256 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -25,6 +25,262 @@ The original codebase can be found at [Xiang-cd/DiffEdit-stable-diffusion](https
25
25
26
26
This pipeline was contributed by [clarencechen](https://github.com/clarencechen). ❤️
27
27
28
+
The [`StableDiffusionDiffEditPipeline`] requires an image mask and a set of partially inverted latents. The image mask is generated from the [`~StableDiffusionDiffEditPipeline.generate_mask`] function, and includes two parameters, `source_prompt` and `target_prompt`. These parameters determine what to edit in the image. For example, if you want to change a bowl of *fruits* to a bowl of *pears*, then:
29
+
30
+
```py
31
+
source_prompt ="a bowl of fruits"
32
+
target_prompt ="a bowl of pears"
33
+
```
34
+
35
+
The partially inverted latents are generated from the [`~StableDiffusionDiffEditPipeline.invert`] function, and it is generally a good idea to include a `prompt` or *caption* describing the image to help guide the inverse latent sampling process. The caption can often be your `source_prompt`, but feel free to experiment with other text descriptions!
36
+
37
+
Let's load the pipeline, scheduler, inverse scheduler, and enable some optimizations to reduce memory usage:
38
+
39
+
```py
40
+
import torch
41
+
from diffusers import DDIMScheduler, DDIMInverseScheduler, StableDiffusionDiffEditPipeline
Use the [`~StableDiffusionDiffEditPipeline.generate_mask`] function to generate the image mask. You'll need to pass it the `source_prompt` and `target_prompt` to specify what to edit in the image:
Finally, pass the image mask and inverted latents to the pipeline. The `target_prompt` becomes the `prompt` now, and the `source_prompt` is used as the `negative_prompt`:
The source and target embeddings can be automatically generated with the [Flan-T5](https://huggingface.co/docs/transformers/model_doc/flan-t5) model instead of creating them manually.
113
+
114
+
Load the Flan-T5 model and tokenizer from the 🤗 Transformers library:
115
+
116
+
```py
117
+
import torch
118
+
from transformers import AutoTokenizer, T5ForConditionalGeneration
Check out the [generation strategy](https://huggingface.co/docs/transformers/main/en/generation_strategies) guide if you're interested in learning more about strategies for generating different quality text.
158
+
159
+
</Tip>
160
+
161
+
Load the text encoder model used by the [`StableDiffusionDiffEditPipeline`] to encode the text. You'll use the text encoder to compute the text embeddings:
162
+
163
+
```py
164
+
import torch
165
+
from diffusers import StableDiffusionDiffEditPipeline
Finally, pass the embeddings to the [`~StableDiffusionDiffEditPipeline.generate_mask`] and [`~StableDiffusionDiffEditPipeline.invert`] functions, and pipeline to generate the image:
194
+
195
+
```diff
196
+
from diffusers import DDIMInverseScheduler, DDIMScheduler
197
+
from diffusers.utils import load_image, make_image_grid
While you can use the `source_prompt` as a caption to help generate the partially inverted latents, you can also use the [BLIP](https://huggingface.co/docs/transformers/model_doc/blip) model to automatically generate a caption.
235
+
236
+
Load the BLIP model and processor from the 🤗 Transformers library:
237
+
238
+
```py
239
+
import torch
240
+
from transformers import BlipForConditionalGeneration, BlipProcessor
0 commit comments