@@ -6,7 +6,7 @@ VAE encode is used for training, image-to-image and image-to-video - turning int
66
77These tables demonstrate the VRAM requirements for VAE encode with SD v1 and SD XL on different GPUs.
88
9- For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled decoding has to be used which increases time taken and impacts quality.
9+ For the majority of these GPUs the memory usage % dictates other models (text encoders, UNet/Transformer) must be offloaded, or tiled encoding has to be used which increases time taken and impacts quality.
1010
1111<details ><summary >SD v1.5</summary >
1212
2828
2929| | ** Endpoint** | ** Model** |
3030| :-:| :-----------:| :--------:|
31- | ** Stable Diffusion v1** | [ https://q1bj3bpq6kzilnsu .us-east-1.aws.endpoints.huggingface.cloud ] ( https://q1bj3bpq6kzilnsu .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` stabilityai/sd-vae-ft-mse ` ] ( https://hf.co/stabilityai/sd-vae-ft-mse ) |
32- | ** Stable Diffusion XL** | [ https://x2dmsqunjd6k9prw .us-east-1.aws.endpoints.huggingface.cloud ] ( https://x2dmsqunjd6k9prw .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` madebyollin/sdxl-vae-fp16-fix ` ] ( https://hf.co/madebyollin/sdxl-vae-fp16-fix ) |
33- | ** Flux** | [ https://whhx50ex1aryqvw6 .us-east-1.aws.endpoints.huggingface.cloud ] ( https://whhx50ex1aryqvw6 .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` black-forest-labs/FLUX.1-schnell ` ] ( https://hf.co/black-forest-labs/FLUX.1-schnell ) |
31+ | ** Stable Diffusion v1** | [ https://qc6479g0aac6qwy9 .us-east-1.aws.endpoints.huggingface.cloud ] ( https://qc6479g0aac6qwy9 .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` stabilityai/sd-vae-ft-mse ` ] ( https://hf.co/stabilityai/sd-vae-ft-mse ) |
32+ | ** Stable Diffusion XL** | [ https://xjqqhmyn62rog84g .us-east-1.aws.endpoints.huggingface.cloud ] ( https://xjqqhmyn62rog84g .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` madebyollin/sdxl-vae-fp16-fix ` ] ( https://hf.co/madebyollin/sdxl-vae-fp16-fix ) |
33+ | ** Flux** | [ https://ptccx55jz97f9zgo .us-east-1.aws.endpoints.huggingface.cloud ] ( https://ptccx55jz97f9zgo .us-east-1.aws.endpoints.huggingface.cloud ) | [ ` black-forest-labs/FLUX.1-schnell ` ] ( https://hf.co/black-forest-labs/FLUX.1-schnell ) |
3434
3535
3636> [ !TIP]
@@ -51,269 +51,91 @@ from diffusers.utils.remote_utils import remote_encode
5151
5252### Basic example
5353
54- Here, we show how to use the remote VAE on random tensors.
55-
56- <details ><summary >Code</summary >
57-
58- ``` python
59- image = remote_decode(
60- endpoint = " https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/" ,
61- tensor = torch.randn([1 , 4 , 64 , 64 ], dtype = torch.float16),
62- scaling_factor = 0.18215 ,
63- )
64- ```
65-
66- </details >
54+ Let's encode an image, then decode it to demonstrate.
6755
6856<figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
69- <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/output.png " />
57+ <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg " />
7058</figure >
7159
72- Usage for Flux is slightly different. Flux latents are packed so we need to send the ` height ` and ` width ` .
73-
7460<details ><summary >Code</summary >
7561
7662``` python
77- image = remote_decode(
78- endpoint = " https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/" ,
79- tensor = torch.randn([1 , 4096 , 64 ], dtype = torch.float16),
80- height = 1024 ,
81- width = 1024 ,
63+ from diffusers.utils import load_image
64+ from diffusers.utils.remote_utils import remote_decode
65+
66+ image = load_image(" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg?download=true" )
67+
68+ latent = remote_encode(
69+ endpoint = " https://ptccx55jz97f9zgo.us-east-1.aws.endpoints.huggingface.cloud/" ,
8270 scaling_factor = 0.3611 ,
8371 shift_factor = 0.1159 ,
8472)
85- ```
8673
87- </details >
88-
89- <figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
90- <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/flux_random_latent.png " />
91- </figure >
92-
93- Finally, an example for HunyuanVideo.
94-
95- <details ><summary >Code</summary >
96-
97- ``` python
98- video = remote_decode(
99- endpoint = " https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/" ,
100- tensor = torch.randn([1 , 16 , 3 , 40 , 64 ], dtype = torch.float16),
101- output_type = " mp4" ,
74+ decoded = remote_decode(
75+ endpoint = " https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/" ,
76+ tensor = latent,
77+ scaling_factor = 0.3611 ,
78+ shift_factor = 0.1159 ,
10279)
103- with open (" video.mp4" , " wb" ) as f:
104- f.write(video)
10580```
10681
10782</details >
10883
10984<figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
110- <video
111- alt="queue.mp4"
112- autoplay loop autobuffer muted playsinline
113- >
114- <source src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video_1.mp4 " type =" video/mp4 " >
115- </video >
85+ <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/decoded.png " />
11686</figure >
11787
11888
11989### Generation
12090
121- But we want to use the VAE on an actual pipeline to get an actual image, not random noise. The example below shows how to do it with SD v1.5.
91+ Now let's look at a generation example, we'll encode the image, generate then remotely decode too!
12292
12393<details ><summary >Code</summary >
12494
12595``` python
126- from diffusers import StableDiffusionPipeline
96+ import torch
97+ from diffusers import StableDiffusionImg2ImgPipeline
98+ from diffusers.utils import load_image
99+ from diffusers.utils.remote_utils import remote_decode, remote_encode
127100
128- pipe = StableDiffusionPipeline .from_pretrained(
101+ pipe = StableDiffusionImg2ImgPipeline .from_pretrained(
129102 " stable-diffusion-v1-5/stable-diffusion-v1-5" ,
130103 torch_dtype = torch.float16,
131104 variant = " fp16" ,
132105 vae = None ,
133106).to(" cuda" )
134107
135- prompt = " Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
108+ init_image = load_image(
109+ " https://raw.githubusercontent.com/CompVis/stable-diffusion/main/assets/stable-samples/img2img/sketch-mountains-input.jpg"
110+ )
111+ init_image = init_image.resize((768 , 512 ))
136112
137- latent = pipe(
138- prompt = prompt,
139- output_type = " latent" ,
140- ).images
141- image = remote_decode(
142- endpoint = " https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/" ,
143- tensor = latent,
113+ init_latent = remote_encode(
114+ endpoint = " https://qc6479g0aac6qwy9.us-east-1.aws.endpoints.huggingface.cloud/" ,
115+ image = init_image,
144116 scaling_factor = 0.18215 ,
145117)
146- image.save(" test.jpg" )
147- ```
148-
149- </details >
150-
151- <figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
152- <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test.jpg " />
153- </figure >
154-
155- Here’s another example with Flux.
156-
157- <details ><summary >Code</summary >
158-
159- ``` python
160- from diffusers import FluxPipeline
161-
162- pipe = FluxPipeline.from_pretrained(
163- " black-forest-labs/FLUX.1-schnell" ,
164- torch_dtype = torch.bfloat16,
165- vae = None ,
166- ).to(" cuda" )
167-
168- prompt = " Strawberry ice cream, in a stylish modern glass, coconut, splashing milk cream and honey, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious"
169118
119+ prompt = " A fantasy landscape, trending on artstation"
170120latent = pipe(
171121 prompt = prompt,
172- guidance_scale = 0.0 ,
173- num_inference_steps = 4 ,
122+ image = init_latent ,
123+ strength = 0.75 ,
174124 output_type = " latent" ,
175125).images
176- image = remote_decode(
177- endpoint = " https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/" ,
178- tensor = latent,
179- height = 1024 ,
180- width = 1024 ,
181- scaling_factor = 0.3611 ,
182- shift_factor = 0.1159 ,
183- )
184- image.save(" test.jpg" )
185- ```
186-
187- </details >
188-
189- <figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
190- <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/test_1.jpg " />
191- </figure >
192126
193- Here’s an example with HunyuanVideo.
194-
195- <details ><summary >Code</summary >
196-
197- ``` python
198- from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
199-
200- model_id = " hunyuanvideo-community/HunyuanVideo"
201- transformer = HunyuanVideoTransformer3DModel.from_pretrained(
202- model_id, subfolder = " transformer" , torch_dtype = torch.bfloat16
203- )
204- pipe = HunyuanVideoPipeline.from_pretrained(
205- model_id, transformer = transformer, vae = None , torch_dtype = torch.float16
206- ).to(" cuda" )
207-
208- latent = pipe(
209- prompt = " A cat walks on the grass, realistic" ,
210- height = 320 ,
211- width = 512 ,
212- num_frames = 61 ,
213- num_inference_steps = 30 ,
214- output_type = " latent" ,
215- ).frames
216-
217- video = remote_decode(
218- endpoint = " https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/" ,
127+ image = remote_decode(
128+ endpoint = " https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/" ,
219129 tensor = latent,
220- output_type = " mp4" ,
221- )
222-
223- if isinstance (video, bytes ):
224- with open (" video.mp4" , " wb" ) as f:
225- f.write(video)
226- ```
227-
228- </details >
229-
230- <figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
231- <video
232- alt="queue.mp4"
233- autoplay loop autobuffer muted playsinline
234- >
235- <source src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/video.mp4 " type =" video/mp4 " >
236- </video >
237- </figure >
238-
239-
240- ### Queueing
241-
242- One of the great benefits of using a remote VAE is that we can queue multiple generation requests. While the current latent is being processed for decoding, we can already queue another one. This helps improve concurrency.
243-
244-
245- <details ><summary >Code</summary >
246-
247- ``` python
248- import queue
249- import threading
250- from IPython.display import display
251- from diffusers import StableDiffusionPipeline
252-
253- def decode_worker (q : queue.Queue):
254- while True :
255- item = q.get()
256- if item is None :
257- break
258- image = remote_decode(
259- endpoint = " https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/" ,
260- tensor = item,
261- scaling_factor = 0.18215 ,
262- )
263- display(image)
264- q.task_done()
265-
266- q = queue.Queue()
267- thread = threading.Thread(target = decode_worker, args = (q,), daemon = True )
268- thread.start()
269-
270- def decode (latent : torch.Tensor):
271- q.put(latent)
272-
273- prompts = [
274- " Blueberry ice cream, in a stylish modern glass , ice cubes, nuts, mint leaves, splashing milk cream, in a gradient purple background, fluid motion, dynamic movement, cinematic lighting, Mysterious" ,
275- " Lemonade in a glass, mint leaves, in an aqua and white background, flowers, ice cubes, halo, fluid motion, dynamic movement, soft lighting, digital painting, rule of thirds composition, Art by Greg rutkowski, Coby whitmore" ,
276- " Comic book art, beautiful, vintage, pastel neon colors, extremely detailed pupils, delicate features, light on face, slight smile, Artgerm, Mary Blair, Edmund Dulac, long dark locks, bangs, glowing, fashionable style, fairytale ambience, hot pink." ,
277- " Masterpiece, vanilla cone ice cream garnished with chocolate syrup, crushed nuts, choco flakes, in a brown background, gold, cinematic lighting, Art by WLOP" ,
278- " A bowl of milk, falling cornflakes, berries, blueberries, in a white background, soft lighting, intricate details, rule of thirds, octane render, volumetric lighting" ,
279- " Cold Coffee with cream, crushed almonds, in a glass, choco flakes, ice cubes, wet, in a wooden background, cinematic lighting, hyper realistic painting, art by Carne Griffiths, octane render, volumetric lighting, fluid motion, dynamic movement, muted colors," ,
280- ]
281-
282- pipe = StableDiffusionPipeline.from_pretrained(
283- " Lykon/dreamshaper-8" ,
284- torch_dtype = torch.float16,
285- vae = None ,
286- ).to(" cuda" )
287-
288- pipe.unet = pipe.unet.to(memory_format = torch.channels_last)
289- pipe.unet = torch.compile(pipe.unet, mode = " reduce-overhead" , fullgraph = True )
290-
291- _ = pipe(
292- prompt = prompts[0 ],
293- output_type = " latent" ,
130+ scaling_factor = 0.18215 ,
294131)
295-
296- for prompt in prompts:
297- latent = pipe(
298- prompt = prompt,
299- output_type = " latent" ,
300- ).images
301- decode(latent)
302-
303- q.put(None )
304- thread.join()
132+ image.save(" fantasy_landscape.jpg" )
305133```
306134
307135</details >
308136
309-
310137<figure class =" image flex flex-col items-center justify-center text-center m-0 w-full " >
311- <video
312- alt="queue.mp4"
313- autoplay loop autobuffer muted playsinline
314- >
315- <source src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/queue.mp4 " type =" video/mp4 " >
316- </video >
138+ <img src =" https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/blog/remote_vae/fantasy_landscape.png " />
317139</figure >
318140
319141## Integrations
0 commit comments