Skip to content

Commit 6503a17

Browse files
committed
add doc
1 parent 33f3acb commit 6503a17

File tree

6 files changed

+349
-0
lines changed

6 files changed

+349
-0
lines changed

docs/source/en/_toctree.yml

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -79,6 +79,8 @@
7979
- sections:
8080
- local: using-diffusers/cogvideox
8181
title: CogVideoX
82+
- local: using-diffusers/consisid
83+
title: ConsisID
8284
- local: using-diffusers/sdxl
8385
title: Stable Diffusion XL
8486
- local: using-diffusers/sdxl_turbo
@@ -260,6 +262,8 @@
260262
title: AuraFlowTransformer2DModel
261263
- local: api/models/cogvideox_transformer3d
262264
title: CogVideoXTransformer3DModel
265+
- local: api/models/consisid_transformer3d
266+
title: ConsisIDTransformer3DModel
263267
- local: api/models/cogview3plus_transformer2d
264268
title: CogView3PlusTransformer2DModel
265269
- local: api/models/dit_transformer2d
@@ -350,6 +354,8 @@
350354
title: BLIP-Diffusion
351355
- local: api/pipelines/cogvideox
352356
title: CogVideoX
357+
- local: api/pipelines/consisid
358+
title: ConsisID
353359
- local: api/pipelines/cogview3
354360
title: CogView3
355361
- local: api/pipelines/consistency_models
Lines changed: 30 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,30 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License. -->
11+
12+
# ConsisIDTransformer3DModel
13+
14+
A Diffusion Transformer model for 3D data from [ConsisID](https://github.com/PKU-YuanGroup/ConsisID) was introduced in [Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/pdf/2411.17440) by Peking University & University of Rochester & etc.
15+
16+
The model can be loaded with the following code snippet.
17+
18+
```python
19+
from diffusers import ConsisIDTransformer3DModel
20+
21+
transformer = ConsisIDTransformer3DModel.from_pretrained("BestWishYsh/ConsisID-preview", subfolder="transformer", torch_dtype=torch.bfloat16).to("cuda")
22+
```
23+
24+
## ConsisIDTransformer3DModel
25+
26+
[[autodoc]] ConsisIDTransformer3DModel
27+
28+
## Transformer2DModelOutput
29+
30+
[[autodoc]] models.modeling_outputs.Transformer2DModelOutput
Lines changed: 111 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,111 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License.
14+
-->
15+
16+
# ConsisID
17+
18+
[Identity-Preserving Text-to-Video Generation by Frequency Decomposition](https://arxiv.org/abs/2411.17440) from Peking University & University of Rochester & etc, by Shenghai Yuan, Jinfa Huang, Xianyi He, Yunyang Ge, Yujun Shi, Liuhan Chen, Jiebo Luo, Li Yuan.
19+
20+
The abstract from the paper is:
21+
22+
*Identity-preserving text-to-video (IPT2V) generation aims to create high-fidelity videos with consistent human identity. It is an important task in video generation but remains an open problem for generative models. This paper pushes the technical frontier of IPT2V in two directions that have not been resolved in the literature: (1) A tuning-free pipeline without tedious case-by-case finetuning, and (2) A frequency-aware heuristic identity-preserving Diffusion Transformer (DiT)-based control scheme. To achieve these goals, we propose **ConsisID**, a tuning-free DiT-based controllable IPT2V model to keep human-**id**entity **consis**tent in the generated video. Inspired by prior findings in frequency analysis of vision/diffusion transformers, it employs identity-control signals in the frequency domain, where facial features can be decomposed into low-frequency global features (e.g., profile, proportions) and high-frequency intrinsic features (e.g., identity markers that remain unaffected by pose changes). First, from a low-frequency perspective, we introduce a global facial extractor, which encodes the reference image and facial key points into a latent space, generating features enriched with low-frequency information. These features are then integrated into the shallow layers of the network to alleviate training challenges associated with DiT. Second, from a high-frequency perspective, we design a local facial extractor to capture high-frequency details and inject them into the transformer blocks, enhancing the model's ability to preserve fine-grained features. To leverage the frequency information for identity preservation, we propose a hierarchical training strategy, transforming a vanilla pre-trained video generation model into an IPT2V model. Extensive experiments demonstrate that our frequency-aware heuristic scheme provides an optimal control solution for DiT-based models. Thanks to this scheme, our **ConsisID** achieves excellent results in generating high-quality, identity-preserving videos, making strides towards more effective IPT2V. The model weight of ConsID is publicly available at https://github.com/PKU-YuanGroup/ConsisID.*
23+
24+
<Tip>
25+
26+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
27+
28+
</Tip>
29+
30+
This pipeline was contributed by [SHYuanBest](https://github.com/SHYuanBest). The original codebase can be found [here](https://github.com/PKU-YuanGroup/ConsisID). The original weights can be found under [hf.co/BestWishYsh](https://huggingface.co/BestWishYsh).
31+
32+
There are two official ConsisID checkpoints for identity-preserving text-to-video.
33+
34+
| checkpoints | recommended inference dtype |
35+
|:---:|:---:|
36+
| [`BestWishYsh/ConsisID-preview`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
37+
| [`BestWishYsh/ConsisID-1.5`](https://huggingface.co/BestWishYsh/ConsisID-preview) | torch.bfloat16 |
38+
39+
## Inference
40+
41+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
42+
43+
First, load the pipeline:
44+
45+
```python
46+
import torch
47+
from diffusers import ConsisIDPipeline
48+
from diffusers.pipelines.consisid.util_consisid import prepare_face_models, process_face_embeddings_infer
49+
from diffusers.utils import export_to_video
50+
from huggingface_hub import snapshot_download
51+
52+
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
53+
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
54+
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
55+
```
56+
57+
Then change the memory layout of the pipelines `transformer` component to `torch.channels_last`:
58+
59+
```python
60+
pipe.transformer.to(memory_format=torch.channels_last)
61+
```
62+
63+
Compile the components and run inference:
64+
65+
```python
66+
pipe.transformer = torch.compile(pipeline.transformer, mode="max-autotune", fullgraph=True)
67+
68+
# ConsisID works well with long and well-described prompts and image contain clear face (e.g., preferably half-body or full-body).
69+
prompt = "A woman adorned with a delicate flower crown, is standing amidst a field of gently swaying wildflowers. Her eyes sparkle with a serene gaze, and a faint smile graces her lips, suggesting a moment of peaceful contentment. The shot is framed from the waist up, highlighting the gentle breeze lightly tousling her hair. The background reveals an expansive meadow under a bright blue sky, capturing the tranquility of a sunny afternoon."
70+
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/1.png?raw=true"
71+
72+
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
73+
is_kps = getattr(pipe.transformer.config, 'is_kps', False)
74+
kps_cond = face_kps if is_kps else None
75+
76+
video = pipe(image=image, prompt=prompt, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=kps_cond, generator=torch.Generator("cuda").manual_seed(42))
77+
export_to_video(video.frames[0], "output.mp4", fps=8)
78+
```
79+
80+
### Memory optimization
81+
82+
ConsisID requires about 37 GB of GPU memory to decode 49 frames (6 seconds of video at 8 FPS) with output resolution 720x480 (W x H), which makes it not possible to run on consumer GPUs or free-tier T4 Colab. The following memory optimizations could be used to reduce the memory footprint. For replication, you can refer to [this](https://gist.github.com/a-r-r-o-w/3959a03f15be5c9bd1fe545b09dfcc93) script.
83+
84+
- `pipe.enable_model_cpu_offload()`:
85+
- Without enabling cpu offloading, memory usage is `33 GB`
86+
- With enabling cpu offloading, memory usage is `19 GB`
87+
- `pipe.enable_sequential_cpu_offload()`:
88+
- Similar to `enable_model_cpu_offload` but can significantly reduce memory usage at the cost of slow inference
89+
- When enabled, memory usage is under `4 GB`
90+
- `pipe.vae.enable_tiling()`:
91+
- With enabling cpu offloading and tiling, memory usage is `11 GB`
92+
- `pipe.vae.enable_slicing()`
93+
94+
### Quantized inference
95+
96+
[torchao](https://github.com/pytorch/ao) and [optimum-quanto](https://github.com/huggingface/optimum-quanto/) can be used to quantize the text encoder, transformer and VAE modules to lower the memory requirements. This makes it possible to run the model on a free-tier T4 Colab or lower VRAM GPUs!
97+
98+
It is also worth noting that torchao quantization is fully compatible with [torch.compile](/optimization/torch2.0#torchcompile), which allows for much faster inference speed. Additionally, models can be serialized and stored in a quantized datatype to save disk space with torchao. Find examples and benchmarks in the gists below.
99+
- [torchao](https://gist.github.com/a-r-r-o-w/4d9732d17412888c885480c6521a9897)
100+
- [quanto](https://gist.github.com/a-r-r-o-w/31be62828b00a9292821b85c1017effa)
101+
102+
## ConsisIDPipeline
103+
104+
[[autodoc]] ConsisIDPipeline
105+
106+
- all
107+
- __call__
108+
109+
## ConsisIDPipelineOutput
110+
111+
[[autodoc]] pipelines.consisid.pipeline_output.ConsisIDPipelineOutput
Lines changed: 100 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,100 @@
1+
<!--Copyright 2024 The HuggingFace Team. All rights reserved.
2+
3+
Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with
4+
the License. You may obtain a copy of the License at
5+
6+
http://www.apache.org/licenses/LICENSE-2.0
7+
8+
Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on
9+
an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the
10+
specific language governing permissions and limitations under the License.
11+
-->
12+
# ConsisID
13+
14+
[ConsisID](https://github.com/PKU-YuanGroup/ConsisID) is an identity-preserving text-to-video generation model, which keep the face consistent in the generated video by frequency decomposition. There is a [video](https://www.youtube.com/watch?v=PhlgC-bI5SQ) show its powerful function. It has the following features:
15+
16+
​ 🔥 **Frequency Decomposition**: The characteristics of the DiT architecture are analyzed from the frequency domain perspective, and based on these characteristics, a reasonable control information injection method is designed.
17+
18+
​ 🔥 **Consistency Training Strategy**: We propose a coarse-to-fine training strategy, dynamic masking loss, and dynamic cross-face loss, which further enhance the model's generalization ability and identity preservation performance.
19+
20+
​ 🔥 **Inference Without Fine-Tuning**: Previous methods required case-by-case fine-tuning of the input ID before inference, leading to significant time and computational costs. In contrast, consisid is tuning-free.
21+
22+
For more information, please refer to the [paper](https://arxiv.org/abs/2411.17440). This guide will walk you through using ConsisID for use cases.
23+
24+
## Load Model Checkpoints
25+
Model weights may be stored in separate subfolders on the Hub or locally, in which case, you should use the [`~DiffusionPipeline.from_pretrained`] method.
26+
27+
28+
```python
29+
import torch
30+
from diffusers import ConsisIDPipeline
31+
from diffusers.pipelines.consisid.util_consisid import prepare_face_models, process_face_embeddings_infer
32+
from huggingface_hub import snapshot_download
33+
34+
# Download ckpts
35+
snapshot_download(repo_id="BestWishYsh/ConsisID-preview", local_dir="BestWishYsh/ConsisID-preview")
36+
37+
# Load face helper model to preprocess input face image
38+
face_helper_1, face_helper_2, face_clip_model, face_main_model, eva_transform_mean, eva_transform_std = prepare_face_models("BestWishYsh/ConsisID-preview", device="cuda", dtype=torch.bfloat16)
39+
40+
# Load consisid base model
41+
pipe = ConsisIDPipeline.from_pretrained("BestWishYsh/ConsisID-preview", torch_dtype=torch.bfloat16)
42+
pipe.to("cuda")
43+
```
44+
45+
## Identity-Preserving Text-to-Video
46+
For identity-preserving text-to-video, pass a text prompt and an image contain clear face (e.g., preferably half-body or full-body). By default, ConsisID generates a 720x480 video for the best results.
47+
48+
```python
49+
from diffusers.utils import export_to_video
50+
51+
prompt = "A woman adorned with a delicate flower crown, is standing amidst a field of gently swaying wildflowers. Her eyes sparkle with a serene gaze, and a faint smile graces her lips, suggesting a moment of peaceful contentment. The shot is framed from the waist up, highlighting the gentle breeze lightly tousling her hair. The background reveals an expansive meadow under a bright blue sky, capturing the tranquility of a sunny afternoon."
52+
image = "https://github.com/PKU-YuanGroup/ConsisID/blob/main/asserts/example_images/1.png?raw=true"
53+
54+
id_cond, id_vit_hidden, image, face_kps = process_face_embeddings_infer(face_helper_1, face_clip_model, face_helper_2, eva_transform_mean, eva_transform_std, face_main_model, "cuda", torch.bfloat16, image, is_align_face=True)
55+
is_kps = getattr(pipe.transformer.config, 'is_kps', False)
56+
kps_cond = face_kps if is_kps else None
57+
58+
video = pipe(image=image, prompt=prompt, use_dynamic_cfg=False, id_vit_hidden=id_vit_hidden, id_cond=id_cond, kps_cond=kps_cond, generator=torch.Generator("cuda").manual_seed(42))
59+
export_to_video(video.frames[0], "output.mp4", fps=8)
60+
```
61+
<table>
62+
<tr>
63+
<th style="text-align: center;">Face Image</th>
64+
<th style="text-align: center;">Video</th>
65+
<th style="text-align: center;">Description</th
66+
</tr>
67+
<tr>
68+
<td><img src="https://github.com/user-attachments/assets/be0257b5-9d90-47ba-93f4-5faf78fd1859" style="height: auto; width: 600px;"></td>
69+
<td><img src="https://github.com/user-attachments/assets/f0e2803c-7214-4463-afd8-b28c0cd87c64" style="height: auto; width: 2000px;"></td>
70+
<td>The video features a woman in exquisite hybrid armor adorned with iridescent gemstones, standing amidst gently falling cherry blossoms. Her piercing yet serene gaze hints at quiet determination, as a breeze catches a loose strand of her hair ......</td>
71+
</tr>
72+
<tr>
73+
<td><img src="https://github.com/user-attachments/assets/c1418804-3e5b-4f8b-87f1-25d4ddeee99e" style="height: auto; width: 600px;"></td>
74+
<td><img src="https://github.com/user-attachments/assets/3491e75c-e01a-41d3-ae01-0c2535b7fa81" style="height: auto; width: 2000px;"></td>
75+
<td>The video features a baby wearing a bright superhero cape, standing confidently with arms raised in a powerful pose. The baby has a determined look on their face, with eyes wide and lips pursed in concentration, as if ready to take on a challenge ......</td>
76+
</tr>
77+
<tr>
78+
<td><img src="https://github.com/user-attachments/assets/2c4ea113-47cd-4295-b643-a10e2a566823" style="height: auto; width: 600px;"></td>
79+
<td><img src="https://github.com/user-attachments/assets/2ffb154f-23dc-4314-9976-95c0bd16810b" style="height: auto; width: 2000px;;"></td>
80+
<td>The video captures a boy walking along a city street, filmed in black and white on a classic 35mm camera. His expression is thoughtful, his brow slightly furrowed as if he's lost in contemplation. The film grain adds a textured ......</td>
81+
</tr>
82+
<tr>
83+
<td><img src="https://github.com/user-attachments/assets/d48cb0be-0a64-40fa-8f86-ac406548d592" style="height: auto; width: 600px;"></td>
84+
<td><img src="https://github.com/user-attachments/assets/9eb298a3-4c2a-407e-b73b-32f88895df22" style="height: auto; width: 2000px;;"></td>
85+
<td>The video features a man standing at an easel, focused intently as his brush dances across the canvas. His expression is one of deep concentration, with a hint of satisfaction as each brushstroke adds color and form ......</td>
86+
</tr>
87+
</table>
88+
89+
## Citation
90+
91+
If you find consisid useful in your research, please consider giving a star and citation.
92+
93+
```BibTeX
94+
@article{yuan2024identity,
95+
title={Identity-Preserving Text-to-Video Generation by Frequency Decomposition},
96+
author={Yuan, Shenghai and Huang, Jinfa and He, Xianyi and Ge, Yunyuan and Shi, Yujun and Chen, Liuhan and Luo, Jiebo and Yuan, Li},
97+
journal={arXiv preprint arXiv:2411.17440},
98+
year={2024}
99+
}
100+
```

docs/source/zh/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,8 @@
55
title: 快速入门
66
- local: stable_diffusion
77
title: 有效和高效的扩散
8+
- local: consisid
9+
title: 身份保持的文本到视频生成
810
- local: installation
911
title: 安装
1012
title: 开始

0 commit comments

Comments
 (0)