Skip to content

Commit e3e8b20

Browse files
authored
Merge branch 'main' into metadata-lora
2 parents a4f78c8 + 3a31b29 commit e3e8b20

File tree

61 files changed

+1652
-583
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

61 files changed

+1652
-583
lines changed

.github/workflows/pr_tests.yml

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -291,8 +291,8 @@ jobs:
291291
- name: Failure short reports
292292
if: ${{ failure() }}
293293
run: |
294-
cat reports/tests_lora_failures_short.txt
295-
cat reports/tests_models_lora_failures_short.txt
294+
cat reports/tests_peft_main_failures_short.txt
295+
cat reports/tests_models_lora_peft_main_failures_short.txt
296296
297297
- name: Test suite reports artifacts
298298
if: ${{ always() }}

docs/source/en/_toctree.yml

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -180,8 +180,6 @@
180180
title: Accelerate inference
181181
- local: optimization/memory
182182
title: Reduce memory usage
183-
- local: optimization/torch2.0
184-
title: PyTorch 2.0
185183
- local: optimization/xformers
186184
title: xFormers
187185
- local: optimization/tome

docs/source/en/api/pipelines/deepfloyd_if.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -347,7 +347,7 @@ pipe.to("cuda")
347347
image = pipe(image=image, prompt="<prompt>", strength=0.3).images
348348
```
349349

350-
You can also use [`torch.compile`](../../optimization/torch2.0). Note that we have not exhaustively tested `torch.compile`
350+
You can also use [`torch.compile`](../../optimization/fp16#torchcompile). Note that we have not exhaustively tested `torch.compile`
351351
with IF and it might not give expected results.
352352

353353
```py

docs/source/en/api/pipelines/sana_sprint.md

Lines changed: 34 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -88,12 +88,46 @@ image.save("sana.png")
8888

8989
Users can tweak the `max_timesteps` value for experimenting with the visual quality of the generated outputs. The default `max_timesteps` value was obtained with an inference-time search process. For more details about it, check out the paper.
9090

91+
## Image to Image
92+
93+
The [`SanaSprintImg2ImgPipeline`] is a pipeline for image-to-image generation. It takes an input image and a prompt, and generates a new image based on the input image and the prompt.
94+
95+
```py
96+
import torch
97+
from diffusers import SanaSprintImg2ImgPipeline
98+
from diffusers.utils.loading_utils import load_image
99+
100+
image = load_image(
101+
"https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/penguin.png"
102+
)
103+
104+
pipe = SanaSprintImg2ImgPipeline.from_pretrained(
105+
"Efficient-Large-Model/Sana_Sprint_1.6B_1024px_diffusers",
106+
torch_dtype=torch.bfloat16)
107+
pipe.to("cuda")
108+
109+
image = pipe(
110+
prompt="a cute pink bear",
111+
image=image,
112+
strength=0.5,
113+
height=832,
114+
width=480
115+
).images[0]
116+
image.save("output.png")
117+
```
118+
91119
## SanaSprintPipeline
92120

93121
[[autodoc]] SanaSprintPipeline
94122
- all
95123
- __call__
96124

125+
## SanaSprintImg2ImgPipeline
126+
127+
[[autodoc]] SanaSprintImg2ImgPipeline
128+
- all
129+
- __call__
130+
97131

98132
## SanaPipelineOutput
99133

docs/source/en/optimization/fp16.md

Lines changed: 24 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -150,6 +150,24 @@ pipeline(prompt, num_inference_steps=30).images[0]
150150

151151
Compilation is slow the first time, but once compiled, it is significantly faster. Try to only use the compiled pipeline on the same type of inference operations. Calling the compiled pipeline on a different image size retriggers compilation which is slow and inefficient.
152152

153+
### Regional compilation
154+
155+
[Regional compilation](https://docs.pytorch.org/tutorials/recipes/regional_compilation.html) reduces the cold start compilation time by only compiling a specific repeated region (or block) of the model instead of the entire model. The compiler reuses the cached and compiled code for the other blocks.
156+
157+
[Accelerate](https://huggingface.co/docs/accelerate/index) provides the [compile_regions](https://github.com/huggingface/accelerate/blob/273799c85d849a1954a4f2e65767216eb37fa089/src/accelerate/utils/other.py#L78) method for automatically compiling the repeated blocks of a `nn.Module` sequentially. The rest of the model is compiled separately.
158+
159+
```py
160+
# pip install -U accelerate
161+
import torch
162+
from diffusers import StableDiffusionXLPipeline
163+
from accelerate.utils import compile regions
164+
165+
pipeline = StableDiffusionXLPipeline.from_pretrained(
166+
"stabilityai/stable-diffusion-xl-base-1.0", torch_dtype=torch.float16
167+
).to("cuda")
168+
pipeline.unet = compile_regions(pipeline.unet, mode="reduce-overhead", fullgraph=True)
169+
```
170+
153171
### Graph breaks
154172

155173
It is important to specify `fullgraph=True` in torch.compile to ensure there are no graph breaks in the underlying model. This allows you to take advantage of torch.compile without any performance degradation. For the UNet and VAE, this changes how you access the return variables.
@@ -170,6 +188,12 @@ The `step()` function is [called](https://github.com/huggingface/diffusers/blob/
170188

171189
In general, the `sigmas` should [stay on the CPU](https://github.com/huggingface/diffusers/blob/35a969d297cba69110d175ee79c59312b9f49e1e/src/diffusers/schedulers/scheduling_euler_discrete.py#L240) to avoid the communication sync and latency.
172190

191+
### Benchmarks
192+
193+
Refer to the [diffusers/benchmarks](https://huggingface.co/datasets/diffusers/benchmarks) dataset to see inference latency and memory usage data for compiled pipelines.
194+
195+
The [diffusers-torchao](https://github.com/sayakpaul/diffusers-torchao#benchmarking-results) repository also contains benchmarking results for compiled versions of Flux and CogVideoX.
196+
173197
## Dynamic quantization
174198

175199
[Dynamic quantization](https://pytorch.org/tutorials/recipes/recipes/dynamic_quantization.html) improves inference speed by reducing precision to enable faster math operations. This particular type of quantization determines how to scale the activations based on the data at runtime rather than using a fixed scaling factor. As a result, the scaling factor is more accurately aligned with the data.

docs/source/en/optimization/tome.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -93,4 +93,4 @@ To reproduce this benchmark, feel free to use this [script](https://gist.github.
9393
| | | 2 | OOM | 13 | 10.78 |
9494
| | | 1 | OOM | 6.66 | 5.54 |
9595

96-
As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](torch2.0).
96+
As seen in the tables above, the speed-up from `tomesd` becomes more pronounced for larger image resolutions. It is also interesting to note that with `tomesd`, it is possible to run the pipeline on a higher resolution like 1024x1024. You may be able to speed-up inference even more with [`torch.compile`](fp16#torchcompile).

0 commit comments

Comments
 (0)