Skip to content

Commit 081e68f

Browse files
committed
[hybrid inference 🍯🐝] Add VAE encode
1 parent 26149c0 commit 081e68f

File tree

5 files changed

+662
-22
lines changed

5 files changed

+662
-22
lines changed

docs/source/en/hybrid_inference/overview.md

Lines changed: 3 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Hybrid Inference offers a fast and simple way to offload local generation requir
3636
## Available Models
3737

3838
* **VAE Decode 🖼️:** Quickly decode latent representations into high-quality images without compromising performance or workflow speed.
39-
* **VAE Encode 🔢 (coming soon):** Efficiently encode images into latent representations for generation and training.
39+
* **VAE Encode 🔢:** Efficiently encode images into latent representations for generation and training.
4040
* **Text Encoders 📃 (coming soon):** Compute text embeddings for your prompts quickly and accurately, ensuring a smooth and high-quality workflow.
4141

4242
---
@@ -48,7 +48,8 @@ Hybrid Inference offers a fast and simple way to offload local generation requir
4848

4949
## Contents
5050

51-
The documentation is organized into two sections:
51+
The documentation is organized into three sections:
5252

5353
* **VAE Decode** Learn the basics of how to use VAE Decode with Hybrid Inference.
54+
* **VAE Encode** Learn the basics of how to use VAE Encode with Hybrid Inference.
5455
* **API Reference** Dive into task-specific settings and parameters.
Lines changed: 322 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,322 @@
1+
# Getting Started: VAE Encode with Hybrid Inference
2+
3+
VAE encode is used for training, image-to-image and image-to-video - turning into images or videos into latent representations.
4+
5+
## Memory
6+
7+
These tables demonstrate the VRAM requirements for VAE encode with SD v1 and SD XL on different GPUs.
8+
9+
For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled decoding has to be used which increases time taken and impacts quality.
10+
11+
<details><summary>SD v1.5</summary>
12+
13+
| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) |
14+
| --- | --- | --- | --- | --- | --- |
15+
TODO
16+
17+
</details>
18+
19+
<details><summary>SDXL</summary>
20+
21+
| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) |
22+
| --- | --- | --- | --- | --- | --- |
23+
TODO
24+
25+
</details>
26+
27+
## Available VAEs
28+
29+
| | **Endpoint** | **Model** |
30+
|:-:|:-----------:|:--------:|
31+
| **Stable Diffusion v1** | [https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud](https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) |
32+
| **Stable Diffusion XL** | [https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud](https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) |
33+
| **Flux** | [https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud](https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) |
34+
35+
36+
> [!TIP]
37+
> Model support can be requested [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml).
38+
39+
40+
## Code
41+
42+
> [!TIP]
43+
> Install `diffusers` from `main` to run the code: `pip install git+https://github.com/huggingface/diffusers@main`
44+
45+
46+
A helper method simplifies interacting with Hybrid Inference.
47+
48+
```python
49+
from diffusers.utils.remote_utils import remote_encode
50+
```
51+
52+
### Basic example
53+
54+
Here, we show how to use the remote VAE on random tensors.
55+
56+
<details><summary>Code</summary>
57+
58+
```python
59+
image = remote_decode(
60+
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
61+
tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16),
62+
scaling_factor=0.18215,
63+
)
64+
```
65+
66+
</details>
67+
68+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
69+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/output.png"/>
70+
</figure>
71+
72+
Usage for Flux is slightly different. Flux latents are packed so we need to send the `height` and `width`.
73+
74+
<details><summary>Code</summary>
75+
76+
```python
77+
image = remote_decode(
78+
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
79+
tensor=torch.randn([1, 4096, 64], dtype=torch.float16),
80+
height=1024,
81+
width=1024,
82+
scaling_factor=0.3611,
83+
shift_factor=0.1159,
84+
)
85+
```
86+
87+
</details>
88+
89+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
90+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/flux_random_latent.png"/>
91+
</figure>
92+
93+
Finally, an example for HunyuanVideo.
94+
95+
<details><summary>Code</summary>
96+
97+
```python
98+
video = remote_decode(
99+
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
100+
tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
101+
output_type="mp4",
102+
)
103+
with open("video.mp4", "wb") as f:
104+
f.write(video)
105+
```
106+
107+
</details>
108+
109+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
110+
<video
111+
alt="queue.mp4"
112+
autoplay loop autobuffer muted playsinline
113+
>
114+
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video_1.mp4" type="video/mp4">
115+
</video>
116+
</figure>
117+
118+
119+
### Generation
120+
121+
But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.
122+
123+
<details><summary>Code</summary>
124+
125+
```python
126+
from diffusers import StableDiffusionPipeline
127+
128+
pipe = StableDiffusionPipeline.from_pretrained(
129+
"stable-diffusion-v1-5/stable-diffusion-v1-5",
130+
torch_dtype=torch.float16,
131+
variant="fp16",
132+
vae=None,
133+
).to("cuda")
134+
135+
prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
136+
137+
latent = pipe(
138+
prompt=prompt,
139+
output_type="latent",
140+
).images
141+
image = remote_decode(
142+
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
143+
tensor=latent,
144+
scaling_factor=0.18215,
145+
)
146+
image.save("test.jpg")
147+
```
148+
149+
</details>
150+
151+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
152+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test.jpg"/>
153+
</figure>
154+
155+
Here’s another example with Flux.
156+
157+
<details><summary>Code</summary>
158+
159+
```python
160+
from diffusers import FluxPipeline
161+
162+
pipe = FluxPipeline.from_pretrained(
163+
"black-forest-labs/FLUX.1-schnell",
164+
torch_dtype=torch.bfloat16,
165+
vae=None,
166+
).to("cuda")
167+
168+
prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
169+
170+
latent = pipe(
171+
prompt=prompt,
172+
guidance_scale=0.0,
173+
num_inference_steps=4,
174+
output_type="latent",
175+
).images
176+
image = remote_decode(
177+
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
178+
tensor=latent,
179+
height=1024,
180+
width=1024,
181+
scaling_factor=0.3611,
182+
shift_factor=0.1159,
183+
)
184+
image.save("test.jpg")
185+
```
186+
187+
</details>
188+
189+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
190+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test_1.jpg"/>
191+
</figure>
192+
193+
Here’s an example with HunyuanVideo.
194+
195+
<details><summary>Code</summary>
196+
197+
```python
198+
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
199+
200+
model_id = "hunyuanvideo-community/HunyuanVideo"
201+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
202+
model_id, subfolder="transformer", torch_dtype=torch.bfloat16
203+
)
204+
pipe = HunyuanVideoPipeline.from_pretrained(
205+
model_id, transformer=transformer, vae=None, torch_dtype=torch.float16
206+
).to("cuda")
207+
208+
latent = pipe(
209+
prompt="A cat walks on the grass, realistic",
210+
height=320,
211+
width=512,
212+
num_frames=61,
213+
num_inference_steps=30,
214+
output_type="latent",
215+
).frames
216+
217+
video = remote_decode(
218+
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
219+
tensor=latent,
220+
output_type="mp4",
221+
)
222+
223+
if isinstance(video, bytes):
224+
with open("video.mp4", "wb") as f:
225+
f.write(video)
226+
```
227+
228+
</details>
229+
230+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
231+
<video
232+
alt="queue.mp4"
233+
autoplay loop autobuffer muted playsinline
234+
>
235+
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video.mp4" type="video/mp4">
236+
</video>
237+
</figure>
238+
239+
240+
### Queueing
241+
242+
One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.
243+
244+
245+
<details><summary>Code</summary>
246+
247+
```python
248+
import queue
249+
import threading
250+
from IPython.display import display
251+
from diffusers import StableDiffusionPipeline
252+
253+
def decode_worker(q: queue.Queue):
254+
while True:
255+
item = q.get()
256+
if item is None:
257+
break
258+
image = remote_decode(
259+
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
260+
tensor=item,
261+
scaling_factor=0.18215,
262+
)
263+
display(image)
264+
q.task_done()
265+
266+
q = queue.Queue()
267+
thread = threading.Thread(target=decode_worker, args=(q,), daemon=True)
268+
thread.start()
269+
270+
def decode(latent: torch.Tensor):
271+
q.put(latent)
272+
273+
prompts = [
274+
"Blueberry ice cream, in a stylish modern glass , ice cubes, nuts, mint leaves, splashing milk cream, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious",
275+
"Lemonade in a glass, mint leaves, in an aqua and white background, flowers, ice cubes, halo, fluid motion, dynamic movement, soft lighting, digital painting, rule of thirds composition, Art by Greg rutkowski, Coby whitmore",
276+
"Comic book art, beautiful, vintage, pastel neon colors, extremely detailed pupils, delicate features, light on face, slight smile, Artgerm, Mary Blair, Edmund Dulac, long dark locks, bangs, glowing, fashionable style, fairytale ambience, hot pink.",
277+
"Masterpiece, vanilla cone ice cream garnished with chocolate syrup, crushed nuts, choco flakes, in a brown background, gold, cinematic lighting, Art by WLOP",
278+
"A bowl of milk, falling cornflakes, berries, blueberries, in a white background, soft lighting, intricate details, rule of thirds, octane render, volumetric lighting",
279+
"Cold Coffee with cream, crushed almonds, in a glass, choco flakes, ice cubes, wet, in a wooden background, cinematic lighting, hyper realistic painting, art by Carne Griffiths, octane render, volumetric lighting, fluid motion, dynamic movement, muted colors,",
280+
]
281+
282+
pipe = StableDiffusionPipeline.from_pretrained(
283+
"Lykon/dreamshaper-8",
284+
torch_dtype=torch.float16,
285+
vae=None,
286+
).to("cuda")
287+
288+
pipe.unet = pipe.unet.to(memory_format=torch.channels_last)
289+
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
290+
291+
_ = pipe(
292+
prompt=prompts[0],
293+
output_type="latent",
294+
)
295+
296+
for prompt in prompts:
297+
latent = pipe(
298+
prompt=prompt,
299+
output_type="latent",
300+
).images
301+
decode(latent)
302+
303+
q.put(None)
304+
thread.join()
305+
```
306+
307+
</details>
308+
309+
310+
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
311+
<video
312+
alt="queue.mp4"
313+
autoplay loop autobuffer muted playsinline
314+
>
315+
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/queue.mp4" type="video/mp4">
316+
</video>
317+
</figure>
318+
319+
## Integrations
320+
321+
* **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference.
322+
* **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference.

0 commit comments

Comments
 (0)