Skip to content

Commit 53d42e7

Browse files
committed
init
1 parent 1448b03 commit 53d42e7

File tree

5 files changed

+287
-572
lines changed

5 files changed

+287
-572
lines changed

docs/source/en/_toctree.yml

Lines changed: 4 additions & 12 deletions
Original file line numberDiff line numberDiff line change
@@ -60,6 +60,8 @@
6060
title: Batch inference
6161
- local: training/distributed_inference
6262
title: Distributed inference
63+
- local: hybrid_inference/overview
64+
title: Remote inference
6365

6466
- title: Inference optimization
6567
isExpanded: false
@@ -91,18 +93,6 @@
9193
- local: using-diffusers/image_quality
9294
title: FreeU
9395

94-
- title: Hybrid Inference
95-
isExpanded: false
96-
sections:
97-
- local: hybrid_inference/overview
98-
title: Overview
99-
- local: hybrid_inference/vae_decode
100-
title: VAE Decode
101-
- local: hybrid_inference/vae_encode
102-
title: VAE Encode
103-
- local: hybrid_inference/api_reference
104-
title: API Reference
105-
10696
- title: Modular Diffusers
10797
isExpanded: false
10898
sections:
@@ -280,6 +270,8 @@
280270
title: Outputs
281271
- local: api/quantization
282272
title: Quantization
273+
- local: hybrid_inference/api_reference
274+
title: Remote inference
283275
- title: Modular
284276
sections:
285277
- local: api/modular_diffusers/pipeline
Lines changed: 5 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -1,9 +1,11 @@
1-
# Hybrid Inference API Reference
1+
# Remote inference
22

3-
## Remote Decode
3+
Remote inference provides access to an [Inference Endpoint](https://huggingface.co/docs/inference-endpoints/index) to offload local generation requirements for decoding and encoding.
4+
5+
## remote_decode
46

57
[[autodoc]] utils.remote_utils.remote_decode
68

7-
## Remote Encode
9+
## remote_encode
810

911
[[autodoc]] utils.remote_utils.remote_encode

docs/source/en/hybrid_inference/overview.md

Lines changed: 278 additions & 29 deletions
Original file line numberDiff line numberDiff line change
@@ -10,51 +10,300 @@ an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express o
1010
specific language governing permissions and limitations under the License.
1111
-->
1212

13-
# Hybrid Inference
13+
# Remote inference
1414

15-
**Empowering local AI builders with Hybrid Inference**
15+
> [!TIP]
16+
> This is currently an experimental feature, and if you have any feedback, please feel free to leave it [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml).
1617
18+
Remote inference offloads the decoding and encoding process to a remote endpoint to relax the memory requirements for local inference with large models. This feature is powered by [Inference Endpoints](https://huggingface.co/docs/inference-endpoints/index).
1719

18-
> [!TIP]
19-
> Hybrid Inference is an [experimental feature](https://huggingface.co/blog/remote_vae).
20-
> Feedback can be provided [here](https://github.com/huggingface/diffusers/issues/new?template=remote-vae-pilot-feedback.yml).
20+
This guide will show you how to encode and decode latents with remote inference.
21+
22+
## Encoding
23+
24+
Encoding converts images and videos into latent representations. Refer to the table below for the supported VAEs.
25+
26+
| Model | Endpoint | Model |
27+
|---|---|---|
28+
| Stable Diffusion v1 | https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) |
29+
| Stable Diffusion XL | https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud | [madebyollin/sdxl-vae-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) |
30+
| Flux | https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud | [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) |
31+
32+
```py
33+
import torch
34+
from diffusers import FluxPipeline
35+
from diffusers.utils import load_image
36+
from diffusers.utils.remote_utils import remote_decode, remote_encode
37+
38+
pipeline = FluxPipeline.from_pretrained(
39+
"black-forest-labs/FLUX.1-schnell",
40+
torch_dtype=torch.float16,
41+
vae=None,
42+
device_map="cuda"
43+
)
44+
45+
init_image = load_image(
46+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
47+
)
48+
init_image = init_image.resize((768, 512))
49+
50+
init_latent = remote_encode(
51+
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud",
52+
image=init_image,
53+
scaling_factor=0.3611,
54+
shift_factor=0.1159
55+
)
56+
```
57+
58+
## Decoding
59+
60+
Decoding converts latent representations back into images or videos. Refer to the table below for the available and supported VAEs.
61+
62+
| Model | Endpoint | Model |
63+
|---|---|---|
64+
| Stable Diffusion v1 | https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud | [stabilityai/sd-vae-ft-mse](https://huggingface.co/stabilityai/sd-vae-ft-mse) |
65+
| Stable Diffusion XL | https://x2dmsqunjd6k9prw.us-east-1.aws.endpoints.huggingface.cloud | [madebyollin/sdxl-vae-fp16-fix](https://huggingface.co/madebyollin/sdxl-vae-fp16-fix) |
66+
| Flux | https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud | [black-forest-labs/FLUX.1-schnell](https://huggingface.co/black-forest-labs/FLUX.1-schnell) |
67+
| HunyuanVideo | https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud | [hunyuanvideo-community/HunyuanVideo](https://huggingface.co/hunyuanvideo-community/HunyuanVideo) |
68+
69+
Set the output type to `"latent"` in the pipeline and set the `vae` to `None`. Pass the latents to the [`~utils.remote_decode`] function. For Flux, the latents are packed so the `height` and `width` also need to be passed. The specific `scaling_factor` and `shift_factor` values for each model can be found in the API reference.
70+
71+
<hfoptions id="decode">
72+
<hfoption id="Flux">
73+
74+
```py
75+
from diffusers import FluxPipeline
76+
77+
pipeline = FluxPipeline.from_pretrained(
78+
"black-forest-labs/FLUX.1-schnell",
79+
torch_dtype=torch.bfloat16,
80+
vae=None,
81+
device_map="cuda"
82+
)
83+
84+
prompt = """
85+
A photorealistic Apollo-era photograph of a cat in a small astronaut suit with a bubble helmet, standing on the Moon and holding a flagpole planted in the dusty lunar soil. The flag shows a colorful paw-print emblem. Earth glows in the black sky above the stark gray surface, with sharp shadows and high-contrast lighting like vintage NASA photos.
86+
"""
87+
88+
latent = pipeline(
89+
prompt=prompt,
90+
guidance_scale=0.0,
91+
num_inference_steps=4,
92+
output_type="latent",
93+
).images
94+
image = remote_decode(
95+
endpoint="https://whhx50ex1aryqvw6.us-east-1.aws.endpoints.huggingface.cloud/",
96+
tensor=latent,
97+
height=1024,
98+
width=1024,
99+
scaling_factor=0.3611,
100+
shift_factor=0.1159,
101+
)
102+
image.save("image.jpg")
103+
```
104+
105+
</hfoption>
106+
<hfoption id="HunyuanVideo">
107+
108+
```py
109+
import torch
110+
from diffusers import HunyuanVideoPipeline, HunyuanVideoTransformer3DModel
111+
112+
transformer = HunyuanVideoTransformer3DModel.from_pretrained(
113+
"hunyuanvideo-community/HunyuanVideo", subfolder="transformer", torch_dtype=torch.bfloat16
114+
)
115+
pipeline = HunyuanVideoPipeline.from_pretrained(
116+
model_id, transformer=transformer, vae=None, torch_dtype=torch.float16, device_map="cuda"
117+
)
118+
119+
latent = pipeline(
120+
prompt="A cat walks on the grass, realistic",
121+
height=320,
122+
width=512,
123+
num_frames=61,
124+
num_inference_steps=30,
125+
output_type="latent",
126+
).frames
127+
128+
video = remote_decode(
129+
endpoint="https://o7ywnmrahorts457.us-east-1.aws.endpoints.huggingface.cloud/",
130+
tensor=latent,
131+
output_type="mp4",
132+
)
133+
134+
if isinstance(video, bytes):
135+
with open("video.mp4", "wb") as f:
136+
f.write(video)
137+
```
138+
139+
</hfoption>
140+
</hfoptions>
141+
142+
## Queuing
143+
144+
Remote inference supports queuing to process multiple generation requests. While the current latent is being decoded, you can queue the next prompt.
145+
146+
```py
147+
import queue
148+
import threading
149+
from IPython.display import display
150+
from diffusers import StableDiffusionXLPipeline
151+
152+
def decode_worker(q: queue.Queue):
153+
while True:
154+
item = q.get()
155+
if item is None:
156+
break
157+
image = remote_decode(
158+
endpoint="https://q1bj3bpq6kzilnsu.us-east-1.aws.endpoints.huggingface.cloud/",
159+
tensor=item,
160+
scaling_factor=0.13025,
161+
)
162+
display(image)
163+
q.task_done()
164+
165+
q = queue.Queue()
166+
thread = threading.Thread(target=decode_worker, args=(q,), daemon=True)
167+
thread.start()
168+
169+
def decode(latent: torch.Tensor):
170+
q.put(latent)
171+
172+
prompts = [
173+
"A grainy Apollo-era style photograph of a cat in a snug astronaut suit with a bubble helmet, standing on the lunar surface and gripping a flag with a paw-print emblem. The gray Moon landscape stretches behind it, Earth glowing vividly in the black sky, shadows crisp and high-contrast.",
174+
"A vintage 1960s sci-fi pulp magazine cover illustration of a heroic cat astronaut planting a flag on the Moon. Bold, saturated colors, exaggerated space gear, playful typography floating in the background, Earth painted in bright blues and greens.",
175+
"A hyper-detailed cinematic shot of a cat astronaut on the Moon holding a fluttering flag, fur visible through the helmet glass, lunar dust scattering under its feet. The vastness of space and Earth in the distance create an epic, awe-inspiring tone.",
176+
"A colorful cartoon drawing of a happy cat wearing a chunky, oversized spacesuit, proudly holding a flag with a big paw print on it. The Moon’s surface is simplified with craters drawn like doodles, and Earth in the sky has a smiling face.",
177+
"A monochrome 1969-style press photo of a “first cat on the Moon” moment. The cat, in a tiny astronaut suit, stands by a planted flag, with grainy textures, scratches, and a blurred Earth in the background, mimicking old archival space photos."
178+
]
179+
180+
181+
pipeline = StableDiffusionXLPipeline.from_pretrained(
182+
"https://huggingface.co/stabilityai/stable-diffusion-xl-base-1.0",
183+
torch_dtype=torch.float16,
184+
vae=None,
185+
device_map="cuda"
186+
)
187+
188+
pipeline.unet = pipeline.unet.to(memory_format=torch.channels_last)
189+
pipeline.unet = torch.compile(pipe.unet, mode="reduce-overhead", fullgraph=True)
190+
191+
_ = pipeline(
192+
prompt=prompts[0],
193+
output_type="latent",
194+
)
195+
196+
for prompt in prompts:
197+
latent = pipeline(
198+
prompt=prompt,
199+
output_type="latent",
200+
).images
201+
decode(latent)
202+
203+
q.put(None)
204+
thread.join()
205+
```
206+
207+
## Benchmarks
208+
209+
The tables demonstrate the memory requirements for encoding and decoding with Stable Diffusion v1.5 and SDXL on different GPUs.
21210

211+
For the majority of these GPUs, the memory usage dictates whether other models (text encoders, UNet/transformer) need to be offloaded or required tiled encoding. The latter two techniques increases inference time and impacts quality.
22212

213+
<details><summary>Encoding - Stable Diffusion v1.5</summary>
23214

24-
## Why use Hybrid Inference?
215+
| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) |
216+
|:------------------------------|:-------------|-----------------:|-------------:|--------------------:|-------------------:|
217+
| NVIDIA GeForce RTX 4090 | 512x512 | 0.015 | 3.51901 | 0.015 | 3.51901 |
218+
| NVIDIA GeForce RTX 4090 | 256x256 | 0.004 | 1.3154 | 0.005 | 1.3154 |
219+
| NVIDIA GeForce RTX 4090 | 2048x2048 | 0.402 | 47.1852 | 0.496 | 3.51901 |
220+
| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.078 | 12.2658 | 0.094 | 3.51901 |
221+
| NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.023 | 5.30105 | 0.023 | 5.30105 |
222+
| NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.006 | 1.98152 | 0.006 | 1.98152 |
223+
| NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 0.574 | 71.08 | 0.656 | 5.30105 |
224+
| NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.111 | 18.4772 | 0.14 | 5.30105 |
225+
| NVIDIA GeForce RTX 3090 | 512x512 | 0.032 | 3.52782 | 0.032 | 3.52782 |
226+
| NVIDIA GeForce RTX 3090 | 256x256 | 0.01 | 1.31869 | 0.009 | 1.31869 |
227+
| NVIDIA GeForce RTX 3090 | 2048x2048 | 0.742 | 47.3033 | 0.954 | 3.52782 |
228+
| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.136 | 12.2965 | 0.207 | 3.52782 |
229+
| NVIDIA GeForce RTX 3080 | 512x512 | 0.036 | 8.51761 | 0.036 | 8.51761 |
230+
| NVIDIA GeForce RTX 3080 | 256x256 | 0.01 | 3.18387 | 0.01 | 3.18387 |
231+
| NVIDIA GeForce RTX 3080 | 2048x2048 | 0.863 | 86.7424 | 1.191 | 8.51761 |
232+
| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.157 | 29.6888 | 0.227 | 8.51761 |
233+
| NVIDIA GeForce RTX 3070 | 512x512 | 0.051 | 10.6941 | 0.051 | 10.6941 |
234+
| NVIDIA GeForce RTX 3070 | 256x256 | 0.015 | 3.99743 | 0.015 | 3.99743 |
235+
| NVIDIA GeForce RTX 3070 | 2048x2048 | 1.217 | 96.054 | 1.482 | 10.6941 |
236+
| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.223 | 37.2751 | 0.327 | 10.6941 |
25237

26-
Hybrid Inference offers a fast and simple way to offload local generation requirements.
238+
</details>
27239

28-
- 🚀 **Reduced Requirements:** Access powerful models without expensive hardware.
29-
- 💎 **Without Compromise:** Achieve the highest quality without sacrificing performance.
30-
- 💰 **Cost Effective:** It's free! 🤑
31-
- 🎯 **Diverse Use Cases:** Fully compatible with Diffusers 🧨 and the wider community.
32-
- 🔧 **Developer-Friendly:** Simple requests, fast responses.
240+
<details><summary>Encoding SDXL</summary>
33241

34-
---
242+
| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) |
243+
|:------------------------------|:-------------|-----------------:|----------------------:|-----------------------:|-------------------:|
244+
| NVIDIA GeForce RTX 4090 | 512x512 | 0.029 | 4.95707 | 0.029 | 4.95707 |
245+
| NVIDIA GeForce RTX 4090 | 256x256 | 0.007 | 2.29666 | 0.007 | 2.29666 |
246+
| NVIDIA GeForce RTX 4090 | 2048x2048 | 0.873 | 66.3452 | 0.863 | 15.5649 |
247+
| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.142 | 15.5479 | 0.143 | 15.5479 |
248+
| NVIDIA GeForce RTX 4080 SUPER | 512x512 | 0.044 | 7.46735 | 0.044 | 7.46735 |
249+
| NVIDIA GeForce RTX 4080 SUPER | 256x256 | 0.01 | 3.4597 | 0.01 | 3.4597 |
250+
| NVIDIA GeForce RTX 4080 SUPER | 2048x2048 | 1.317 | 87.1615 | 1.291 | 23.447 |
251+
| NVIDIA GeForce RTX 4080 SUPER | 1024x1024 | 0.213 | 23.4215 | 0.214 | 23.4215 |
252+
| NVIDIA GeForce RTX 3090 | 512x512 | 0.058 | 5.65638 | 0.058 | 5.65638 |
253+
| NVIDIA GeForce RTX 3090 | 256x256 | 0.016 | 2.45081 | 0.016 | 2.45081 |
254+
| NVIDIA GeForce RTX 3090 | 2048x2048 | 1.755 | 77.8239 | 1.614 | 18.4193 |
255+
| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.265 | 18.4023 | 0.265 | 18.4023 |
256+
| NVIDIA GeForce RTX 3080 | 512x512 | 0.064 | 13.6568 | 0.064 | 13.6568 |
257+
| NVIDIA GeForce RTX 3080 | 256x256 | 0.018 | 5.91728 | 0.018 | 5.91728 |
258+
| NVIDIA GeForce RTX 3080 | 2048x2048 | OOM | OOM | 1.866 | 44.4717 |
259+
| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.302 | 44.4308 | 0.302 | 44.4308 |
260+
| NVIDIA GeForce RTX 3070 | 512x512 | 0.093 | 17.1465 | 0.093 | 17.1465 |
261+
| NVIDIA GeForce RTX 3070 | 256x256 | 0.025 | 7.42931 | 0.026 | 7.42931 |
262+
| NVIDIA GeForce RTX 3070 | 2048x2048 | OOM | OOM | 2.674 | 55.8355 |
263+
| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.443 | 55.7841 | 0.443 | 55.7841 |
35264

36-
## Available Models
265+
</details>
37266

38-
* **VAE Decode 🖼️:** Quickly decode latent representations into high-quality images without compromising performance or workflow speed.
39-
* **VAE Encode 🔢:** Efficiently encode images into latent representations for generation and training.
40-
* **Text Encoders 📃 (coming soon):** Compute text embeddings for your prompts quickly and accurately, ensuring a smooth and high-quality workflow.
267+
<details><summary>Decoding - Stable Diffusion v1.5</summary>
41268

42-
---
269+
| GPU | Resolution | Time (seconds) | Memory (%) | Tiled Time (secs) | Tiled Memory (%) |
270+
| --- | --- | --- | --- | --- | --- |
271+
| NVIDIA GeForce RTX 4090 | 512x512 | 0.031 | 5.60% | 0.031 (0%) | 5.60% |
272+
| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.148 | 20.00% | 0.301 (+103%) | 5.60% |
273+
| NVIDIA GeForce RTX 4080 | 512x512 | 0.05 | 8.40% | 0.050 (0%) | 8.40% |
274+
| NVIDIA GeForce RTX 4080 | 1024x1024 | 0.224 | 30.00% | 0.356 (+59%) | 8.40% |
275+
| NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.066 | 11.30% | 0.066 (0%) | 11.30% |
276+
| NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.284 | 40.50% | 0.454 (+60%) | 11.40% |
277+
| NVIDIA GeForce RTX 3090 | 512x512 | 0.062 | 5.20% | 0.062 (0%) | 5.20% |
278+
| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.253 | 18.50% | 0.464 (+83%) | 5.20% |
279+
| NVIDIA GeForce RTX 3080 | 512x512 | 0.07 | 12.80% | 0.070 (0%) | 12.80% |
280+
| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.286 | 45.30% | 0.466 (+63%) | 12.90% |
281+
| NVIDIA GeForce RTX 3070 | 512x512 | 0.102 | 15.90% | 0.102 (0%) | 15.90% |
282+
| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.421 | 56.30% | 0.746 (+77%) | 16.00% |
43283

44-
## Integrations
284+
</details>
45285

46-
* **[SD.Next](https://github.com/vladmandic/sdnext):** All-in-one UI with direct supports Hybrid Inference.
47-
* **[ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae):** ComfyUI node for Hybrid Inference.
286+
<details><summary>Decoding SDXL</summary>
48287

49-
## Changelog
288+
| GPU | Resolution | Time (seconds) | Memory Consumed (%) | Tiled Time (seconds) | Tiled Memory (%) |
289+
| --- | --- | --- | --- | --- | --- |
290+
| NVIDIA GeForce RTX 4090 | 512x512 | 0.057 | 10.00% | 0.057 (0%) | 10.00% |
291+
| NVIDIA GeForce RTX 4090 | 1024x1024 | 0.256 | 35.50% | 0.257 (+0.4%) | 35.50% |
292+
| NVIDIA GeForce RTX 4080 | 512x512 | 0.092 | 15.00% | 0.092 (0%) | 15.00% |
293+
| NVIDIA GeForce RTX 4080 | 1024x1024 | 0.406 | 53.30% | 0.406 (0%) | 53.30% |
294+
| NVIDIA GeForce RTX 4070 Ti | 512x512 | 0.121 | 20.20% | 0.120 (-0.8%) | 20.20% |
295+
| NVIDIA GeForce RTX 4070 Ti | 1024x1024 | 0.519 | 72.00% | 0.519 (0%) | 72.00% |
296+
| NVIDIA GeForce RTX 3090 | 512x512 | 0.107 | 10.50% | 0.107 (0%) | 10.50% |
297+
| NVIDIA GeForce RTX 3090 | 1024x1024 | 0.459 | 38.00% | 0.460 (+0.2%) | 38.00% |
298+
| NVIDIA GeForce RTX 3080 | 512x512 | 0.121 | 25.60% | 0.121 (0%) | 25.60% |
299+
| NVIDIA GeForce RTX 3080 | 1024x1024 | 0.524 | 93.00% | 0.524 (0%) | 93.00% |
300+
| NVIDIA GeForce RTX 3070 | 512x512 | 0.183 | 31.80% | 0.183 (0%) | 31.80% |
301+
| NVIDIA GeForce RTX 3070 | 1024x1024 | 0.794 | 96.40% | 0.794 (0%) | 96.40% |
50302

51-
- March 10 2025: Added VAE encode
52-
- March 2 2025: Initial release with VAE decoding
303+
</details>
53304

54-
## Contents
55305

56-
The documentation is organized into three sections:
306+
## Resources
57307

58-
* **VAE Decode** Learn the basics of how to use VAE Decode with Hybrid Inference.
59-
* **VAE Encode** Learn the basics of how to use VAE Encode with Hybrid Inference.
60-
* **API Reference** Dive into task-specific settings and parameters.
308+
- Remote inference is also supported in [SD.Next](https://github.com/vladmandic/sdnext) and [ComfyUI-HFRemoteVae](https://github.com/kijai/ComfyUI-HFRemoteVae).
309+
- Refer to the [Remote VAEs for decoding with Inference Endpoints](https://huggingface.co/blog/remote_vae) blog post to learn more.

0 commit comments

Comments
 (0)