Skip to content

Commit e5448f2

Browse files
committed
vae_encode docs
1 parent e70bdb2 commit e5448f2

File tree

1 file changed

+40
-218
lines changed

1 file changed

+40
-218
lines changed

docs/source/en/hybrid_inference/vae_encode.md

Lines changed: 40 additions & 218 deletions
Original file line numberDiff line numberDiff line change
@@ -6,7 +6,7 @@ VAE encode is used for training, image-to-image and image-to-video - turning int
66

77
These tables demonstrate the VRAM requirements for VAE encode with SD v1 and SD XL on different GPUs.
88

9-
For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled decoding has to be used which increases time taken and impacts quality.
9+
For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled encoding has to be used which increases time taken and impacts quality.
1010

1111
<details><summary>SD v1.5</summary>
1212

@@ -28,9 +28,9 @@ TODO
2828

2929
| | **Endpoint** | **Model** |
3030
|:-:|:-----------:|:--------:|
31-
| **Stable Diffusion v1** | [https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud](https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) |
32-
| **Stable Diffusion XL** | [https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud](https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) |
33-
| **Flux** | [https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud](https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) |
31+
| **Stable Diffusion v1** | [https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud](https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud) | [`stabilityai/sd-vae-ft-mse`](https://hf.co/stabilityai/sd-vae-ft-mse) |
32+
| **Stable Diffusion XL** | [https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud](https://xjqqhmyn62rog84g.us-east-1.aws.endpoints.huggingface.cloud) | [`madebyollin/sdxl-vae-fp16-fix`](https://hf.co/madebyollin/sdxl-vae-fp16-fix) |
33+
| **Flux** | [https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud](https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud) | [`black-forest-labs/FLUX.1-schnell`](https://hf.co/black-forest-labs/FLUX.1-schnell) |
3434

3535

3636
> [!TIP]
@@ -51,269 +51,91 @@ from diffusers.utils.remote_utils import remote_encode
5151

5252
### Basic example
5353

54-
Here, we show how to use the remote VAE on random tensors.
55-
56-
<details><summary>Code</summary>
57-
58-
```python
59-
image = remote_decode(
60-
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
61-
tensor=torch.randn([1, 4, 64, 64], dtype=torch.float16),
62-
scaling_factor=0.18215,
63-
)
64-
```
65-
66-
</details>
54+
Let's encode an image, then decode it to demonstrate.
6755

6856
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
69-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/output.png"/>
57+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"/>
7058
</figure>
7159

72-
Usage for Flux is slightly different. Flux latents are packed so we need to send the `height` and `width`.
73-
7460
<details><summary>Code</summary>
7561

7662
```python
77-
image = remote_decode(
78-
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
79-
tensor=torch.randn([1, 4096, 64], dtype=torch.float16),
80-
height=1024,
81-
width=1024,
63+
from diffusers.utils import load_image
64+
from diffusers.utils.remote_utils import remote_decode
65+
66+
image = load_image("https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg?download=true")
67+
68+
latent = remote_encode(
69+
endpoint="https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud/",
8270
scaling_factor=0.3611,
8371
shift_factor=0.1159,
8472
)
85-
```
8673

87-
</details>
88-
89-
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
90-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/flux_random_latent.png"/>
91-
</figure>
92-
93-
Finally, an example for HunyuanVideo.
94-
95-
<details><summary>Code</summary>
96-
97-
```python
98-
video = remote_decode(
99-
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
100-
tensor=torch.randn([1, 16, 3, 40, 64], dtype=torch.float16),
101-
output_type="mp4",
74+
decoded = remote_decode(
75+
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
76+
tensor=latent,
77+
scaling_factor=0.3611,
78+
shift_factor=0.1159,
10279
)
103-
with open("video.mp4", "wb") as f:
104-
f.write(video)
10580
```
10681

10782
</details>
10883

10984
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
110-
<video
111-
alt="queue.mp4"
112-
autoplay loop autobuffer muted playsinline
113-
>
114-
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video_1.mp4" type="video/mp4">
115-
</video>
85+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/decoded.png"/>
11686
</figure>
11787

11888

11989
### Generation
12090

121-
But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.
91+
Now let's look at a generation example, we'll encode the image, generate then remotely decode too!
12292

12393
<details><summary>Code</summary>
12494

12595
```python
126-
from diffusers import StableDiffusionPipeline
96+
import torch
97+
from diffusers import StableDiffusionImg2ImgPipeline
98+
from diffusers.utils import load_image
99+
from diffusers.utils.remote_utils import remote_decode, remote_encode
127100

128-
pipe = StableDiffusionPipeline.from_pretrained(
101+
pipe = StableDiffusionImg2ImgPipeline.from_pretrained(
129102
"stable-diffusion-v1-5/stable-diffusion-v1-5",
130103
torch_dtype=torch.float16,
131104
variant="fp16",
132105
vae=None,
133106
).to("cuda")
134107

135-
prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
108+
init_image = load_image(
109+
"https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
110+
)
111+
init_image = init_image.resize((768, 512))
136112

137-
latent = pipe(
138-
prompt=prompt,
139-
output_type="latent",
140-
).images
141-
image = remote_decode(
142-
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
143-
tensor=latent,
113+
init_latent = remote_encode(
114+
endpoint="https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud/",
115+
image=init_image,
144116
scaling_factor=0.18215,
145117
)
146-
image.save("test.jpg")
147-
```
148-
149-
</details>
150-
151-
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
152-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test.jpg"/>
153-
</figure>
154-
155-
Here’s another example with Flux.
156-
157-
<details><summary>Code</summary>
158-
159-
```python
160-
from diffusers import FluxPipeline
161-
162-
pipe = FluxPipeline.from_pretrained(
163-
"black-forest-labs/FLUX.1-schnell",
164-
torch_dtype=torch.bfloat16,
165-
vae=None,
166-
).to("cuda")
167-
168-
prompt = "Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
169118

119+
prompt = "A fantasy landscape, trending on artstation"
170120
latent = pipe(
171121
prompt=prompt,
172-
guidance_scale=0.0,
173-
num_inference_steps=4,
122+
image=init_latent,
123+
strength=0.75,
174124
output_type="latent",
175125
).images
176-
image = remote_decode(
177-
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
178-
tensor=latent,
179-
height=1024,
180-
width=1024,
181-
scaling_factor=0.3611,
182-
shift_factor=0.1159,
183-
)
184-
image.save("test.jpg")
185-
```
186-
187-
</details>
188-
189-
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
190-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test_1.jpg"/>
191-
</figure>
192126

193-
Here’s an example with HunyuanVideo.
194-
195-
<details><summary>Code</summary>
196-
197-
```python
198-
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
199-
200-
model_id = "hunyuanvideo-community/HunyuanVideo"
201-
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
202-
model_id, subfolder="transformer", torch_dtype=torch.bfloat16
203-
)
204-
pipe = HunyuanVideoPipeline.from_pretrained(
205-
model_id, transformer=transformer, vae=None, torch_dtype=torch.float16
206-
).to("cuda")
207-
208-
latent = pipe(
209-
prompt="A cat walks on the grass, realistic",
210-
height=320,
211-
width=512,
212-
num_frames=61,
213-
num_inference_steps=30,
214-
output_type="latent",
215-
).frames
216-
217-
video = remote_decode(
218-
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
127+
image = remote_decode(
128+
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
219129
tensor=latent,
220-
output_type="mp4",
221-
)
222-
223-
if isinstance(video, bytes):
224-
with open("video.mp4", "wb") as f:
225-
f.write(video)
226-
```
227-
228-
</details>
229-
230-
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
231-
<video
232-
alt="queue.mp4"
233-
autoplay loop autobuffer muted playsinline
234-
>
235-
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video.mp4" type="video/mp4">
236-
</video>
237-
</figure>
238-
239-
240-
### Queueing
241-
242-
One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.
243-
244-
245-
<details><summary>Code</summary>
246-
247-
```python
248-
import queue
249-
import threading
250-
from IPython.display import display
251-
from diffusers import StableDiffusionPipeline
252-
253-
def decode_worker(q: queue.Queue):
254-
while True:
255-
item = q.get()
256-
if item is None:
257-
break
258-
image = remote_decode(
259-
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
260-
tensor=item,
261-
scaling_factor=0.18215,
262-
)
263-
display(image)
264-
q.task_done()
265-
266-
q = queue.Queue()
267-
thread = threading.Thread(target=decode_worker, args=(q,), daemon=True)
268-
thread.start()
269-
270-
def decode(latent: torch.Tensor):
271-
q.put(latent)
272-
273-
prompts = [
274-
"Blueberry ice cream, in a stylish modern glass , ice cubes, nuts, mint leaves, splashing milk cream, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious",
275-
"Lemonade in a glass, mint leaves, in an aqua and white background, flowers, ice cubes, halo, fluid motion, dynamic movement, soft lighting, digital painting, rule of thirds composition, Art by Greg rutkowski, Coby whitmore",
276-
"Comic book art, beautiful, vintage, pastel neon colors, extremely detailed pupils, delicate features, light on face, slight smile, Artgerm, Mary Blair, Edmund Dulac, long dark locks, bangs, glowing, fashionable style, fairytale ambience, hot pink.",
277-
"Masterpiece, vanilla cone ice cream garnished with chocolate syrup, crushed nuts, choco flakes, in a brown background, gold, cinematic lighting, Art by WLOP",
278-
"A bowl of milk, falling cornflakes, berries, blueberries, in a white background, soft lighting, intricate details, rule of thirds, octane render, volumetric lighting",
279-
"Cold Coffee with cream, crushed almonds, in a glass, choco flakes, ice cubes, wet, in a wooden background, cinematic lighting, hyper realistic painting, art by Carne Griffiths, octane render, volumetric lighting, fluid motion, dynamic movement, muted colors,",
280-
]
281-
282-
pipe = StableDiffusionPipeline.from_pretrained(
283-
"Lykon/dreamshaper-8",
284-
torch_dtype=torch.float16,
285-
vae=None,
286-
).to("cuda")
287-
288-
pipe.unet = pipe.unet.to(memory_format=torch.channels_last)
289-
pipe.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
290-
291-
_ = pipe(
292-
prompt=prompts[0],
293-
output_type="latent",
130+
scaling_factor=0.18215,
294131
)
295-
296-
for prompt in prompts:
297-
latent = pipe(
298-
prompt=prompt,
299-
output_type="latent",
300-
).images
301-
decode(latent)
302-
303-
q.put(None)
304-
thread.join()
132+
image.save("fantasy_landscape.jpg")
305133
```
306134

307135
</details>
308136

309-
310137
<figure class="image flex flex-col items-center justify-center text-center m-0 w-full">
311-
<video
312-
alt="queue.mp4"
313-
autoplay loop autobuffer muted playsinline
314-
>
315-
<source src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/queue.mp4" type="video/mp4">
316-
</video>
138+
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/fantasy_landscape.png"/>
317139
</figure>
318140

319141
## Integrations

0 commit comments

Comments
 (0)