Skip to content

Commit 05325b0

Browse files
committed
feedback
1 parent 6aec1fa commit 05325b0

File tree

1 file changed

+32
-48
lines changed

1 file changed

+32
-48
lines changed

docs/source/en/quicktour.md

Lines changed: 32 additions & 48 deletions
Original file line numberDiff line numberDiff line change
@@ -12,24 +12,24 @@ specific language governing permissions and limitations under the License.
1212

1313
# Quickstart
1414

15-
Diffusers is a library for developers and researchers that provides an easy inference API for generating images and videos as well as the building blocks for implementing new workflows.
15+
Diffusers is a library for developers and researchers that provides an easy inference API for generating images, videos and audio, as well as the building blocks for implementing new workflows.
1616

1717
Diffusers provides many optimizations out-of-the-box that makes it possible to load and run large models on setups with limited memory or to accelerate inference.
1818

1919
This Quickstart will give you an overview of Diffusers and get you up and generating quickly.
2020

2121
> [!TIP]
22-
> Before you begin, make sure you have a Hugging Face [account](https://huggingface.co/join) in order to use models like [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev).
22+
> Before you begin, make sure you have a Hugging Face [account](https://huggingface.co/join) in order to use gated models like [Flux](https://huggingface.co/black-forest-labs/FLUX.1-dev).
2323
2424
Follow the [Installation](./installation) guide to install Diffusers if it's not already installed.
2525

2626
## DiffusionPipeline
2727

28-
A diffusion model combines multiple components to generate outputs in any modality based on an input, such as a text description or image.
28+
A diffusion model combines multiple components to generate outputs in any modality based on an input, such as a text description, image or both.
2929

3030
For a standard text-to-image model:
3131

32-
1. A text encoder turns a prompt into embeddings that guide the denoising process.
32+
1. A text encoder turns a prompt into embeddings that guide the denoising process. Some models have more than one text encoder.
3333
2. A scheduler contains the algorithmic specifics for gradually denoising initial random noise into clean outputs. Different schedulers affect generation speed and quality.
3434
3. A UNet or diffusion transformer (DiT) is the workhorse of a diffusion model.
3535

@@ -39,7 +39,7 @@ For a standard text-to-image model:
3939

4040
4. A variational autoencoder (VAE) encodes and decodes pixels to a spatially compressed latent-space. *Latents* are compressed representations of an image and are more efficient to work with. The UNet or DiT operates on latents, and the clean latents at the end are decoded back into images.
4141

42-
The [`DiffusionPipeline`] packages all these components into a single class for inference. There are several arguments in [`DiffusionPipeline`] you can change, such as `num_inference_steps`, that affect the diffusion process. Try different values and arguments to see how they change generation quality or speed.
42+
The [`DiffusionPipeline`] packages all these components into a single class for inference. There are several arguments in [`~DiffusionPipeline.__call__`] you can change, such as `num_inference_steps`, that affect the diffusion process. Try different values and arguments to see how they change generation quality or speed.
4343

4444
Load a model with [`~DiffusionPipeline.from_pretrained`] and describe what you'd like to generate. The example below uses the default argument values.
4545

@@ -53,7 +53,7 @@ import torch
5353
from diffusers import DiffusionPipeline
5454

5555
pipeline = DiffusionPipeline.from_pretrained(
56-
"black-forest-labs/FLUX.1-dev",
56+
"Qwen/Qwen-Image",
5757
torch_dtype=torch.bfloat16
5858
).to("cuda")
5959

@@ -64,83 +64,67 @@ highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous
6464
pipeline(prompt).images[0]
6565
```
6666

67-
<div class="flex justify-center">
68-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quicktour-diffusion-pipeline.png">
69-
</div>
70-
7167
</hfoption>
7268
<hfoption id="text-to-video">
7369

7470
Use `.frames[0]` to access the generated video output and [`~utils.export_to_video`] to save the video.
7571

7672
```py
7773
import torch
78-
from diffusers import AutoModel, DiffusionPipeline
74+
from diffusers import AutoencoderKLWan, DiffusionPipeline
7975
from diffusers.quantizers import PipelineQuantizationConfig
8076
from diffusers.utils import export_to_video
8177

82-
pipeline_quant_config = PipelineQuantizationConfig(
83-
quant_backend="bitsandbytes_4bit",
84-
quant_kwargs={
85-
"load_in_4bit": True,
86-
"bnb_4bit_quant_type": "nf4",
87-
"bnb_4bit_compute_dtype": torch.bfloat16
88-
},
89-
components_to_quantize=["transformer"]
78+
vae = AutoencoderKLWan.from_pretrained(
79+
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
80+
subfolder="vae",
81+
torch_dtype=torch.float32
9082
)
9183
pipeline = DiffusionPipeline.from_pretrained(
92-
"hunyuanvideo-community/HunyuanVideo",
93-
quantization_config=pipeline_quant_config,
94-
torch_dtype=torch.bfloat16,
84+
"Wan-AI/Wan2.2-T2V-A14B-Diffusers",
85+
vae=vae
86+
torch_dtype=torch.bfloat16,
9587
).to("cuda")
96-
pipeline.enable_model_cpu_offload()
97-
pipeline.vae.enable_tiling()
9888

9989
prompt = """
10090
Cinematic video of a sleek cat lounging on a colorful inflatable in a crystal-clear turquoise pool in Palm Springs,
10191
sipping a salt-rimmed margarita through a straw. Golden-hour sunlight glows over mid-century modern homes and swaying palms.
10292
Shot in rich Sony a7S III: with moody, glamorous color grading, subtle lens flares, and soft vintage film grain.
10393
Ripples shimmer as a warm desert breeze stirs the water, blending luxury and playful charm in an epic, gorgeously composed frame.
10494
"""
105-
video = pipeline(prompt=prompt, num_frames=61, num_inference_steps=30).frames[0]
106-
export_to_video(video, "output.mp4", fps=15)
95+
video = pipeline(prompt=prompt, num_frames=81, num_inference_steps=40).frames[0]
96+
export_to_video(video, "output.mp4", fps=16)
10797
```
10898

109-
<div class="flex justify-center">
110-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quicktour-diffusion-pipeline-video.gif">
111-
</div>
112-
11399
</hfoption>
114100
</hfoptions>
115101

116102
## LoRA
117103

118104
Adapters insert a small number of trainable parameters to the original base model. Only the inserted parameters are fine-tuned while the rest of the model weights remain frozen. This makes it fast and cheap to fine-tune a model on a new style. Among adapters, [LoRA's](./tutorials/using_peft_for_inference) are the most popular.
119105

120-
Add a LoRA to a pipeline with the [`~loaders.FluxLoraLoaderMixin.load_lora_weights`] method. Some LoRA's require a special word to trigger it, such as `GHIBSKY style`, in the example below. Check a LoRA's model card to see if it requires a trigger word.
106+
Add a LoRA to a pipeline with the [`~loaders.QwenImageLoraLoaderMixin.load_lora_weights`] method. Some LoRA's require a special word to trigger it, such as `Realism`, in the example below. Check a LoRA's model card to see if it requires a trigger word.
121107

122108
```py
123109
import torch
124110
from diffusers import DiffusionPipeline
125111

126112
pipeline = DiffusionPipeline.from_pretrained(
127-
"black-forest-labs/FLUX.1-dev",
113+
"Qwen/Qwen-Image",
128114
torch_dtype=torch.bfloat16
129115
)
130-
pipeline.load_lora_weights("aleksa-codes/flux-ghibsky-illustration")
116+
pipeline.load_lora_weights(
117+
"flymy-ai/qwen-image-realism-lora",
118+
)
131119
pipeline.to("cuda")
132120

133121
prompt = """
134-
GHIBSKY style cinematic film still of a cat sipping a margarita in a pool in Palm Springs in the style of umempart, California
122+
super Realism cinematic film still of a cat sipping a margarita in a pool in Palm Springs in the style of umempart, California
135123
highly detailed, high budget hollywood movie, cinemascope, moody, epic, gorgeous, film grain
136124
"""
137125
pipeline(prompt).images[0]
138126
```
139127

140-
<div class="flex justify-center">
141-
<img src="https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/quicktour-diffusion-pipeline-lora.png">
142-
</div>
143-
144128
Check out the [LoRA](./tutorials/using_peft_for_inference) docs or Adapters section to learn more.
145129

146130
## Quantization
@@ -157,12 +141,12 @@ from diffusers import DiffusionPipeline
157141
from diffusers.quantizers import PipelineQuantizationConfig
158142

159143
quant_config = PipelineQuantizationConfig(
160-
quant_backend="bitsandbytes_4bit",
161-
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
162-
components_to_quantize=["transformer", "text_encoder_2"],
144+
quant_backend="bitsandbytes_4bit",
145+
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
146+
components_to_quantize=["transformer", "text_encoder"],
163147
)
164148
pipeline = DiffusionPipeline.from_pretrained(
165-
"black-forest-labs/FLUX.1-dev",
149+
"Qwen/Qwen-Image",
166150
torch_dtype=torch.bfloat16,
167151
quantization_config=quant_config,
168152
).to("cuda")
@@ -195,12 +179,12 @@ from diffusers import DiffusionPipeline
195179
from diffusers.quantizers import PipelineQuantizationConfig
196180

197181
quant_config = PipelineQuantizationConfig(
198-
quant_backend="bitsandbytes_4bit",
199-
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
200-
components_to_quantize=["transformer", "text_encoder_2"],
182+
quant_backend="bitsandbytes_4bit",
183+
quant_kwargs={"load_in_4bit": True, "bnb_4bit_quant_type": "nf4", "bnb_4bit_compute_dtype": torch.bfloat16},
184+
components_to_quantize=["transformer", "text_encoder"],
201185
)
202186
pipeline = DiffusionPipeline.from_pretrained(
203-
"black-forest-labs/FLUX.1-dev",
187+
"Qwen/Qwen-Image",
204188
torch_dtype=torch.bfloat16,
205189
quantization_config=quant_config,
206190
).to("cuda")
@@ -229,12 +213,12 @@ import torch
229213
from diffusers import DiffusionPipeline
230214

231215
pipeline = DiffusionPipeline.from_pretrained(
232-
"black-forest-labs/FLUX.1-dev",
216+
"Qwen/Qwen-Image",
233217
torch_dtype=torch.bfloat16
234218
).to("cuda")
235219

236220
pipeline.transformer.compile_repeated_blocks(
237-
fullgraph=True, dynamic=True
221+
fullgraph=True,
238222
)
239223
prompt = """
240224
cinematic film still of a cat sipping a margarita in a pool in Palm Springs, California

0 commit comments

Comments
 (0)