Skip to content

Commit 78b269a

Browse files
authored
Merge branch 'main' into sd3-xformers
2 parents 155a846 + fe79489 commit 78b269a

File tree

57 files changed

+375
-200
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

57 files changed

+375
-200
lines changed

.github/ISSUE_TEMPLATE/bug-report.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,7 +73,7 @@ body:
7373
- ControlNet @sayakpaul @yiyixuxu @DN6
7474
- T2I Adapter @sayakpaul @yiyixuxu @DN6
7575
- IF @DN6
76-
- Text-to-Video / Video-to-Video @DN6 @sayakpaul
76+
- Text-to-Video / Video-to-Video @DN6 @a-r-r-o-w
7777
- Wuerstchen @DN6
7878
- Other: @yiyixuxu @DN6
7979
- Improving generation quality: @asomoza

.github/PULL_REQUEST_TEMPLATE.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -49,6 +49,7 @@ Core library:
4949
Integrations:
5050
5151
- deepspeed: HF Trainer/Accelerate: @SunMarc
52+
- PEFT: @sayakpaul @BenjaminBossan
5253
5354
HF projects:
5455
Lines changed: 39 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,39 @@
1+
name: SSH into PR runners
2+
3+
on:
4+
workflow_dispatch:
5+
inputs:
6+
docker_image:
7+
description: 'Name of the Docker image'
8+
required: true
9+
10+
env:
11+
IS_GITHUB_CI: "1"
12+
HF_HUB_READ_TOKEN: ${{ secrets.HF_HUB_READ_TOKEN }}
13+
HF_HOME: /mnt/cache
14+
DIFFUSERS_IS_CI: yes
15+
OMP_NUM_THREADS: 8
16+
MKL_NUM_THREADS: 8
17+
RUN_SLOW: yes
18+
19+
jobs:
20+
ssh_runner:
21+
name: "SSH"
22+
runs-on: [self-hosted, intel-cpu, 32-cpu, 256-ram, ci]
23+
container:
24+
image: ${{ github.event.inputs.docker_image }}
25+
options: --shm-size "16gb" --ipc host -v /mnt/cache/.cache/huggingface/diffusers:/mnt/cache/ --privileged
26+
27+
steps:
28+
- name: Checkout diffusers
29+
uses: actions/checkout@v3
30+
with:
31+
fetch-depth: 2
32+
33+
- name: Tailscale # In order to be able to SSH when a test fails
34+
uses: huggingface/tailscale-action@main
35+
with:
36+
authkey: ${{ secrets.TAILSCALE_SSH_AUTHKEY }}
37+
slackChannel: ${{ secrets.SLACK_CIFEEDBACK_CHANNEL }}
38+
slackToken: ${{ secrets.SLACK_CIFEEDBACK_BOT_TOKEN }}
39+
waitForSSH: true

.github/workflows/ssh-runner.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
name: SSH into runners
1+
name: SSH into GPU runners
22

33
on:
44
workflow_dispatch:

docker/diffusers-pytorch-compile-cuda/Dockerfile

Lines changed: 7 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -16,24 +16,24 @@ RUN apt install -y bash \
1616
ca-certificates \
1717
libsndfile1-dev \
1818
libgl1 \
19-
python3.9 \
20-
python3.9-dev \
19+
python3.10 \
20+
python3.10-dev \
2121
python3-pip \
22-
python3.9-venv && \
22+
python3.10-venv && \
2323
rm -rf /var/lib/apt/lists
2424

2525
# make sure to use venv
26-
RUN python3.9 -m venv /opt/venv
26+
RUN python3.10 -m venv /opt/venv
2727
ENV PATH="/opt/venv/bin:$PATH"
2828

2929
# pre-install the heavy dependencies (these can later be overridden by the deps from setup.py)
30-
RUN python3.9 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
31-
python3.9 -m uv pip install --no-cache-dir \
30+
RUN python3.10 -m pip install --no-cache-dir --upgrade pip uv==0.1.11 && \
31+
python3.10 -m uv pip install --no-cache-dir \
3232
torch \
3333
torchvision \
3434
torchaudio \
3535
invisible_watermark && \
36-
python3.9 -m pip install --no-cache-dir \
36+
python3.10 -m pip install --no-cache-dir \
3737
accelerate \
3838
datasets \
3939
hf-doc-builder \

docker/diffusers-pytorch-cpu/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -16,6 +16,7 @@ RUN apt install -y bash \
1616
ca-certificates \
1717
libsndfile1-dev \
1818
python3.10 \
19+
python3.10-dev \
1920
python3-pip \
2021
libgl1 \
2122
python3.10-venv && \

docker/diffusers-pytorch-cuda/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ RUN apt install -y bash \
1717
libsndfile1-dev \
1818
libgl1 \
1919
python3.10 \
20+
python3.10-dev \
2021
python3-pip \
2122
python3.10-venv && \
2223
rm -rf /var/lib/apt/lists

docker/diffusers-pytorch-xformers-cuda/Dockerfile

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -17,6 +17,7 @@ RUN apt install -y bash \
1717
libsndfile1-dev \
1818
libgl1 \
1919
python3.10 \
20+
python3.10-dev \
2021
python3-pip \
2122
python3.10-venv && \
2223
rm -rf /var/lib/apt/lists

docs/source/en/_toctree.yml

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -332,6 +332,8 @@
332332
title: Latent Consistency Models
333333
- local: api/pipelines/latent_diffusion
334334
title: Latent Diffusion
335+
- local: api/pipelines/latte
336+
title: Latte
335337
- local: api/pipelines/ledits_pp
336338
title: LEDITS++
337339
- local: api/pipelines/lumina
Lines changed: 75 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,75 @@
1+
<!-- # Copyright 2024 The HuggingFace Team. All rights reserved.
2+
#
3+
# Licensed under the Apache License, Version 2.0 (the "License");
4+
# you may not use this file except in compliance with the License.
5+
# You may obtain a copy of the License at
6+
#
7+
# http://www.apache.org/licenses/LICENSE-2.0
8+
#
9+
# Unless required by applicable law or agreed to in writing, software
10+
# distributed under the License is distributed on an "AS IS" BASIS,
11+
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
12+
# See the License for the specific language governing permissions and
13+
# limitations under the License. -->
14+
15+
# Latte
16+
17+
![latte text-to-video](https://github.com/Vchitect/Latte/blob/52bc0029899babbd6e9250384c83d8ed2670ff7a/visuals/latte.gif?raw=true)
18+
19+
[Latte: Latent Diffusion Transformer for Video Generation](https://arxiv.org/abs/2401.03048) from Monash University, Shanghai AI Lab, Nanjing University, and Nanyang Technological University.
20+
21+
The abstract from the paper is:
22+
23+
*We propose a novel Latent Diffusion Transformer, namely Latte, for video generation. Latte first extracts spatio-temporal tokens from input videos and then adopts a series of Transformer blocks to model video distribution in the latent space. In order to model a substantial number of tokens extracted from videos, four efficient variants are introduced from the perspective of decomposing the spatial and temporal dimensions of input videos. To improve the quality of generated videos, we determine the best practices of Latte through rigorous experimental analysis, including video clip patch embedding, model variants, timestep-class information injection, temporal positional embedding, and learning strategies. Our comprehensive evaluation demonstrates that Latte achieves state-of-the-art performance across four standard video generation datasets, i.e., FaceForensics, SkyTimelapse, UCF101, and Taichi-HD. In addition, we extend Latte to text-to-video generation (T2V) task, where Latte achieves comparable results compared to recent T2V models. We strongly believe that Latte provides valuable insights for future research on incorporating Transformers into diffusion models for video generation.*
24+
25+
**Highlights**: Latte is a latent diffusion transformer proposed as a backbone for modeling different modalities (trained for text-to-video generation here). It achieves state-of-the-art performance across four standard video benchmarks - [FaceForensics](https://arxiv.org/abs/1803.09179), [SkyTimelapse](https://arxiv.org/abs/1709.07592), [UCF101](https://arxiv.org/abs/1212.0402) and [Taichi-HD](https://arxiv.org/abs/2003.00196). To prepare and download the datasets for evaluation, please refer to [this https URL](https://github.com/Vchitect/Latte/blob/main/docs/datasets_evaluation.md).
26+
27+
<Tip>
28+
29+
Make sure to check out the Schedulers [guide](../../using-diffusers/schedulers.md) to learn how to explore the tradeoff between scheduler speed and quality, and see the [reuse components across pipelines](../../using-diffusers/loading.md#reuse-a-pipeline) section to learn how to efficiently load the same components into multiple pipelines.
30+
31+
</Tip>
32+
33+
### Inference
34+
35+
Use [`torch.compile`](https://huggingface.co/docs/diffusers/main/en/tutorials/fast_diffusion#torchcompile) to reduce the inference latency.
36+
37+
First, load the pipeline:
38+
39+
```python
40+
import torch
41+
from diffusers import LattePipeline
42+
43+
pipeline = LattePipeline.from_pretrained(
44+
"maxin-cn/Latte-1", torch_dtype=torch.float16
45+
).to("cuda")
46+
```
47+
48+
Then change the memory layout of the pipelines `transformer` and `vae` components to `torch.channels-last`:
49+
50+
```python
51+
pipeline.transformer.to(memory_format=torch.channels_last)
52+
pipeline.vae.to(memory_format=torch.channels_last)
53+
```
54+
55+
Finally, compile the components and run inference:
56+
57+
```python
58+
pipeline.transformer = torch.compile(pipeline.transformer)
59+
pipeline.vae.decode = torch.compile(pipeline.vae.decode)
60+
61+
video = pipeline(prompt="A dog wearing sunglasses floating in space, surreal, nebulae in background").frames[0]
62+
```
63+
64+
The [benchmark](https://gist.github.com/a-r-r-o-w/4e1694ca46374793c0361d740a99ff19) results on an 80GB A100 machine are:
65+
66+
```
67+
Without torch.compile(): Average inference time: 16.246 seconds.
68+
With torch.compile(): Average inference time: 14.573 seconds.
69+
```
70+
71+
## LattePipeline
72+
73+
[[autodoc]] LattePipeline
74+
- all
75+
- __call__

0 commit comments

Comments
 (0)