Releases: huggingface/diffusers
v0.30.3: CogVideoX Image-to-Video and Video-to-Video
This patch release adds Diffusers support for the upcoming CogVideoX-5B-I2V release (an Image-to-Video generation model)! The model weights will be available by end of the week on the HF Hub at THUDM/CogVideoX-5b-I2V (Link). Stay tuned for the release!
This release features two new pipelines:
- CogVideoXImageToVideoPipeline
- CogVideoXVideoToVideoPipeline
Additionally, we now have support for tiled encoding in the CogVideoX VAE. This can be enabled by calling the vae.enable_tiling() method, and it is used in the new Video-to-Video pipeline to encode sample videos to latents in a memory-efficient manner.
CogVideoXImageToVideoPipeline
The code below demonstrates how to use the new image-to-video pipeline:
import torch
from diffusers import CogVideoXImageToVideoPipeline
from diffusers.utils import export_to_video, load_image
pipe = CogVideoXImageToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-I2V", torch_dtype=torch.bfloat16)
pipe.to("cuda")
# Optionally, enable memory optimizations.
# If enabling CPU offloading, remember to remove `pipe.to("cuda")` above
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
prompt = "An astronaut hatching from an egg, on the surface of the moon, the darkness and depth of space realised in the background. High quality, ultrarealistic detail and breath-taking movie-like camera shot."
image = load_image(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/astronaut.jpg"
)
video = pipe(image, prompt, use_dynamic_cfg=True)
export_to_video(video.frames[0], "output.mp4", fps=8)|  | CogVideoXImageToVideoExample.mp4 | 
CogVideoXVideoToVideoPipeline
The code below demonstrates how to use the new video-to-video pipeline:
import torch
from diffusers import CogVideoXDPMScheduler, CogVideoXVideoToVideoPipeline
from diffusers.utils import export_to_video, load_video
# Models: "THUDM/CogVideoX-2b" or "THUDM/CogVideoX-5b"
pipe = CogVideoXVideoToVideoPipeline.from_pretrained("THUDM/CogVideoX-5b-trial", torch_dtype=torch.bfloat16)
pipe.scheduler = CogVideoXDPMScheduler.from_config(pipe.scheduler.config)
pipe.to("cuda")
input_video = load_video(
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/hiker.mp4"
)
prompt = (
    "An astronaut stands triumphantly at the peak of a towering mountain. Panorama of rugged peaks and "
    "valleys. Very futuristic vibe and animated aesthetic. Highlights of purple and golden colors in "
    "the scene. The sky is looks like an animated/cartoonish dream of galaxies, nebulae, stars, planets, "
    "moons, but the remainder of the scene is mostly realistic."
)
video = pipe(
    video=input_video, prompt=prompt, strength=0.8, guidance_scale=6, num_inference_steps=50
).frames[0]
export_to_video(video, "output.mp4", fps=8)| CogVideoXVideoToVideoExample.mp4 | 
Shoutout to @tin2tin for the awesome demonstration!
Refer to our documentation to learn more about it.
All commits
- [core] Support VideoToVideo with CogVideoX by @a-r-r-o-w in #9333
- [core] CogVideoX memory optimizations in VAE encode by @a-r-r-o-w in #9340
- [CI] Quick fix for Cog Video Test by @DN6 in #9373
- [refactor] move positional embeddings to patch embed layer for CogVideoX by @a-r-r-o-w in #9263
- CogVideoX-5b-I2V support by @zRzRzRzRzRzRzR in #9418
v0.30.2: Update from single file default repository
V0.30.1: CogVideoX-5B & Bug fixes
CogVideoX-5B
This patch release adds diffusers support for the upcoming CogVideoX-5B release! The model weights will be available next week on the Huggingface Hub at THUDM/CogVideoX-5b. Stay tuned for the release!
Additionally, we have implemented VAE tiling feature, which reduces the memory requirement for CogVideoX models. With this update, the total memory requirement is now 12GB for CogVideoX-2B and 21GB for CogVideoX-5B (with CPU offloading). To Enable this feature, simply call enable_tiling() on the VAE.
The code below shows how to generate a video with CogVideoX-5B
import torch
from diffusers import CogVideoXPipeline
from diffusers.utils import export_to_video
prompt = "Tracking shot,late afternoon light casting long shadows,a cyclist in athletic gear pedaling down a scenic mountain road,winding path with trees and a lake in the background,invigorating and adventurous atmosphere."
pipe = CogVideoXPipeline.from_pretrained(
    "THUDM/CogVideoX-5b",
    torch_dtype=torch.bfloat16
)
pipe.enable_model_cpu_offload()
pipe.vae.enable_tiling()
video = pipe(
    prompt=prompt,
    num_videos_per_prompt=1,
    num_inference_steps=50,
    num_frames=49,
    guidance_scale=6,
).frames[0]
export_to_video(video, "output.mp4", fps=8)000000.mp4
Refer to our documentation to learn more about it.
All commits
- Update Video Loading/Export to use imageioby @DN6 in #9094
- [refactor] CogVideoX followups + tiled decoding support by @a-r-r-o-w in #9150
- Add Learned PE selection for Auraflow by @cloneofsimo in #9182
- [Single File] Fix configuring scheduler via legacy kwargs by @DN6 in #9229
- [Flux LoRA] support parsing alpha from a flux lora state dict. by @sayakpaul in #9236
- [tests] fix broken xformers tests by @a-r-r-o-w in #9206
- Cogvideox-5B Model adapter change by @zRzRzRzRzRzRzR in #9203
- [Single File] Support loading Comfy UI Flux checkpoints by @DN6 in #9243
v0.30.0: New Pipelines (Flux, Stable Audio, Kolors, CogVideoX, Latte, and more), New Methods (FreeNoise, SparseCtrl), and New Refactors
New pipelines
Image taken from the Lumina’s GitHub.
This release features many new pipelines. Below, we provide a list:
Audio pipelines 🎼
Video pipelines 📹
- Latte (thanks to @maxin-cn for the contribution through #8404)
- CogVideoX (thanks to @zRzRzRzRzRzRzR for the contribution through #9082)
Image pipelines 🎇
Be sure to check out the respective docs to know more about these pipelines. Some additional pointers are below for curious minds:
- Lumina introduces a new DiT architecture that is multilingual in nature.
- Kolors is inspired by SDXL and is also multilingual in nature.
- Flux introduces the largest (more than 12B parameters!) open-sourced DiT variant available to date. For efficient DreamBooth + LoRA training, we recommend @bghira’s guide here.
- We have worked on a guide that shows how to quantize these large pipelines for memory efficiency with optimum.quanto. Check it out here.
- CogVideoX introduces a novel and truly 3D VAE into Diffusers.
Perturbed Attention Guidance (PAG)
| Without PAG | With PAG | 
|---|---|
|  |  | 
We already had community pipelines for PAG, but given its usefulness, we decided to make it a first-class citizen of the library. We have a central usage guide for PAG here, which should be the entry point for a user interested in understanding and using PAG for their use cases. We currently support the following pipelines with PAG:
- StableDiffusionPAGPipeline
- StableDiffusion3PAGPipeline
- StableDiffusionControlNetPAGPipeline
- StableDiffusionXLPAGPipeline
- StableDiffusionXLPAGImg2ImgPipeline
- StableDiffusionXLPAGInpaintPipeline
- StableDiffusionXLControlNetPAGPipeline
- StableDiffusion3PAGPipeline
- PixArtSigmaPAGPipeline
- HunyuanDiTPAGPipeline
- AnimateDiffPAGPipeline
- KolorsPAGPipeline
If you’re interested in helping us extend our PAG support for other pipelines, please check out this thread.
Special thanks to Ahn Donghoon (@sunovivid), the author of PAG, for helping us with the integration and adding PAG support to SD3.
AnimateDiff with SparseCtrl
SparseCtrl introduces methods of controllability into text-to-video diffusion models leveraging signals such as line/edge sketches, depth maps, and RGB images by incorporating an additional condition encoder, inspired by ControlNet, to process these signals in the AnimateDiff framework. It can be applied to a diverse set of applications such as interpolation or video prediction (filling in the gaps between sequence of images for animation), personalized image animation, sketch-to-video, depth-to-video, and more. It was introduced in SparseCtrl: Adding Sparse Controls to Text-to-Video Diffusion Models.
There are two SparseCtrl-specific checkpoints and a Motion LoRA made available by the authors namely:
Scribble Interpolation Example:
|  |  |  | 
|  | ||
import torch
from diffusers import AnimateDiffSparseControlNetPipeline, AutoencoderKL, MotionAdapter, SparseControlNetModel
from diffusers.schedulers import DPMSolverMultistepScheduler
from diffusers.utils import export_to_gif, load_image
motion_adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-3", torch_dtype=torch.float16).to(device)
controlnet = SparseControlNetModel.from_pretrained("guoyww/animatediff-sparsectrl-scribble", torch_dtype=torch.float16).to(device)
vae = AutoencoderKL.from_pretrained("stabilityai/sd-vae-ft-mse", torch_dtype=torch.float16).to(device)
pipe = AnimateDiffSparseControlNetPipeline.from_pretrained(
    "SG161222/Realistic_Vision_V5.1_noVAE",
    motion_adapter=motion_adapter,
    controlnet=controlnet,
    vae=vae,
    scheduler=scheduler,
    torch_dtype=torch.float16,
).to(device)
pipe.scheduler = DPMSolverMultistepScheduler.from_config(pipe.scheduler.config, beta_schedule="linear", algorithm_type="dpmsolver++", use_karras_sigmas=True)
pipe.load_lora_weights("guoyww/animatediff-motion-lora-v1-5-3", adapter_name="motion_lora")
pipe.fuse_lora(lora_scale=1.0)
prompt = "an aerial view of a cyberpunk city, night time, neon lights, masterpiece, high quality"
negative_prompt = "low quality, worst quality, letterboxed"
image_files = [
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-1.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-2.png",
    "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/diffusers/animatediff-scribble-3.png"
]
condition_frame_indices = [0, 8, 15]
conditioning_frames = [load_image(img_file) for img_file in image_files]
video = pipe(
    prompt=prompt,
    negative_prompt=negative_prompt,
    num_inference_steps=25,
    conditioning_frames=conditioning_frames,
    controlnet_conditioning_scale=1.0,
    controlnet_frame_indices=condition_frame_indices,
    generator=torch.Generator().manual_seed(1337),
).frames[0]
export_to_gif(video, "output.gif")📜 Check out the docs here.
FreeNoise for AnimateDiff
FreeNoise is a training-free method that allows extending the generative capabilities of pretrained video diffusion models beyond their existing context/frame limits.
Instead of initializing noises for all frames, FreeNoise reschedules a sequence of noises for long-range correlation and performs temporal attention over them using a window-based function. We have added FreeNoise to the AnimateDiff family of models in Diffusers, allowing them to generate videos beyond their default 32 frame limit.
 
import torch
from diffusers import AnimateDiffPipeline, MotionAdapter, EulerAncestralDiscreteScheduler
from diffusers.utils import export_to_gif
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-v1-5-2", torch_dtype=torch.float16)
pipe = AnimateDiffPipeline.from_pretrained("SG161222/Realistic_Vision_V6.0_B1_noVAE", motion_adapter=adapter, torch_dtype=torch.float16)
pipe.scheduler = EulerAncestralDiscreteScheduler(
    beta_schedule="linear",
    beta_start=0.00085,
    beta_end=0.012,
)
pipe.enable_free_noise()
pipe.vae.enable_slicing()
pipe.enable_model_cpu_offload()
frames = pipe(
    "An astronaut riding a horse on Mars.",
    num_frames=64,
    num_inference_steps=20,
    guidance_scale=7.0,
    decode_chunk_size=2,
).frames[0]
export_to_gif(frames, "freenoise-64.gif")LoRA refactor
We have significantly refactored the loader classes associated with LoRA. Going forward, this will help in adding LoRA support for new pipelines and models. We now have a LoraBaseMixin class which is subclassed by the different pipeline-level LoRA loading classes such as StableDiffusionXLLoraLoaderMixin. This document provides an overview of the available classes.
Additionally, we have increased the coverage of methods within the PeftAdapterMixin class.  This refactoring allows all the supported models to share common LoRA functionalities such set_adapter(), add_adapter(), and so on.
To learn more details, please follow this PR. If you see any LoRA-related iss...
v0.29.2: fix deprecation and LoRA bugs 🐞
All commits
- [SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) by @Dalanke in #8558
- [LoRA] refactor lora conversion utility. by @sayakpaul in #8295
- [LoRA] fix conversion utility so that lora dora loads correctly by @sayakpaul in #8688
- [Chore] remove deprecation from transformer2d regarding the output class. by @sayakpaul in #8698
- [LoRA] fix vanilla fine-tuned lora loading. by @sayakpaul in #8691
- Release: v0.29.2 by @sayakpaul (direct commit on v0.29.2-patch)
v0.29.1: SD3 ControlNet, Expanded SD3 `from_single_file` support, Using long Prompts with T5 Text Encoder & Bug fixes
SD3 CntrolNet
 
import torch
from diffusers import StableDiffusion3ControlNetPipeline
from diffusers.models import SD3ControlNetModel, SD3MultiControlNetModel
from diffusers.utils import load_image
controlnet = SD3ControlNetModel.from_pretrained("InstantX/SD3-Controlnet-Canny", torch_dtype=torch.float16)
pipe = StableDiffusion3ControlNetPipeline.from_pretrained(
    "stabilityai/stable-diffusion-3-medium-diffusers", controlnet=controlnet, torch_dtype=torch.float16
)
pipe.to("cuda")
control_image = load_image("https://huggingface.co/InstantX/SD3-Controlnet-Canny/resolve/main/canny.jpg")
prompt = "A girl holding a sign that says InstantX"
image = pipe(prompt, control_image=control_image, controlnet_conditioning_scale=0.7).images[0]
image.save("sd3.png")📜 Refer to the official docs here to learn more about it.
Thanks to @haofanwang @wangqixun from the @ResearcherXman team for contributing this pipeline!
Expanded single file support
We now support all available single-file checkpoints for sd3 in diffusers! To load the single file checkpoint with t5
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_single_file(
    "https://huggingface.co/stabilityai/stable-diffusion-3-medium/blob/main/sd3_medium_incl_clips_t5xxlfp8.safetensors",
    torch_dtype=torch.float16,
)
pipe.enable_model_cpu_offload()
image = pipe("a picture of a cat holding a sign that says hello world").images[0]
image.save('sd3-single-file-t5-fp8.png')Using Long Prompts with the T5 Text Encoder
We increased the default sequence length for the T5 Text Encoder from a maximum of 77 to 256!  It can be adjusted to accept fewer or more tokens by setting the max_sequence_length to a maximum of 512. Keep in mind that longer sequences require additional resources and will result in longer generation times. This effect is particularly noticeable during batch inference.
prompt = "A whimsical and creative image depicting a hybrid creature that is a mix of a waffle and a hippopotamus. This imaginative creature features the distinctive, bulky body of a hippo, but with a texture and appearance resembling a golden-brown, crispy waffle. The creature might have elements like waffle squares across its skin and a syrup-like sheen. It’s set in a surreal environment that playfully combines a natural water habitat of a hippo with elements of a breakfast table setting, possibly including oversized utensils or plates in the background. The image should evoke a sense of playful absurdity and culinary fantasy."
image = pipe(
    prompt=prompt,
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=4.5,
    max_sequence_length=512,
).images[0]| Before | max_sequence_length=256 | max_sequence_length=512 | 
|---|---|---|
|  |  |  | 
All commits
- Release: v0.29.0 by @sayakpaul (direct commit on v0.29.1-patch)
- prepare for patch release by @yiyixuxu (direct commit on v0.29.1-patch)
- fix warning log for Transformer SD3 by @sayakpaul in #8496
- Add SD3 AutoPipeline mappings by @Beinsezii in #8489
- Add Hunyuan AutoPipe mapping by @Beinsezii in #8505
- Expand Single File support in SD3 Pipeline by @DN6 in #8517
- [Single File Loading] Handle unexpected keys in CLIP models when accelerateisn't installed. by @DN6 in #8462
- Fix sharding when no device_map is passed by @SunMarc in #8531
- [SD3 Inference] T5 Token limit by @asomoza in #8506
- Fix gradient checkpointing issue for Stable Diffusion 3 by @Carolinabanana in #8542
- Support SD3 ControlNet and Multi-ControlNet. by @wangqixun in #8566
- fix from_single_file for checkpoints with t5 by @yiyixuxu in #8631
- [SD3] Fix mis-matched shape when num_images_per_prompt > 1 using without T5 (text_encoder_3=None) by @Dalanke in #8558
Significant community contributions
The following contributors have made significant changes to the library over the last release:
- @wangqixun
- Support SD3 ControlNet and Multi-ControlNet. (#8566)
 
v0.29.0: Stable Diffusion 3
This release emphasizes Stable Diffusion 3, Stability AI’s latest iteration of the Stable Diffusion family of models. It was introduced in Scaling Rectified Flow Transformers for High-Resolution Image Synthesis by Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, Dustin Podell, Tim Dockhorn, Zion English, Kyle Lacey, Alex Goodwin, Yannik Marek, and Robin Rombach.
As the model is gated, before using it with diffusers, you first need to go to the Stable Diffusion 3 Medium Hugging Face page, fill in the form and accept the gate. Once you are in, you need to log in so that your system knows you’ve accepted the gate.
huggingface-cli loginThe code below shows how to perform text-to-image generation with SD3:
import torch
from diffusers import StableDiffusion3Pipeline
pipe = StableDiffusion3Pipeline.from_pretrained("stabilityai/stable-diffusion-3-medium-diffusers", torch_dtype=torch.float16)
pipe = pipe.to("cuda")
image = pipe(
    "A cat holding a sign that says hello world",
    negative_prompt="",
    num_inference_steps=28,
    guidance_scale=7.0,
).images[0]
imageRefer to our documentation for learning all the optimizations you can apply to SD3 as well as the image-to-image pipeline.
Additionally, we support DreamBooth + LoRA fine-tuning of Stable Diffusion 3 through rectified flow. Check out this directory for more details.
v0.28.2: fix `from_single_file` clip model checkpoint key error 🐞
v0.28.1: HunyuanDiT and Transformer2D model class variants
This patch release primarily introduces the Hunyuan DiT pipeline from the Tencent team.
Hunyuan DiT
Hunyuan DiT is a transformer-based diffusion pipeline, introduced in the Hunyuan-DiT : A Powerful Multi-Resolution Diffusion Transformer with Fine-Grained Chinese Understanding paper by the Tencent Hunyuan.
import torch
from diffusers import HunyuanDiTPipeline
pipe = HunyuanDiTPipeline.from_pretrained(
    "Tencent-Hunyuan/HunyuanDiT-Diffusers", torch_dtype=torch.float16
)
pipe.to("cuda")
# You may also use English prompt as HunyuanDiT supports both English and Chinese
# prompt = "An astronaut riding a horse"
prompt = "一个宇航员在骑马"
image = pipe(prompt).images[0]🧠 This pipeline has support for multi-linguality.
📜 Refer to the official docs here to learn more about it.
Thanks to @gnobitab, for contributing Hunyuan DiT in #8240.
All commits
- Release: v0.28.0 by @sayakpaul (direct commit on v0.28.1-patch)
- [Core] Introduce class variants for Transformer2DModelby @sayakpaul in #7647
- resolve comflicts by @toshas (direct commit on v0.28.1-patch)
- Tencent Hunyuan Team: add HunyuanDiT related updates by @gnobitab in #8240
- Tencent Hunyuan Team - Updated Doc for HunyuanDiT by @gnobitab in #8383
- [Transformer2DModel] Handle norm_typesafely while remapping by @sayakpaul in #8370
- Release: v0.28.1 by @sayakpaul (direct commit on v0.28.1-patch)
Significant community contributions
The following contributors have made significant changes to the library over the last release:
v0.28.0: Marigold, PixArt Sigma, AnimateDiff SDXL, InstantStyle, VQGAN Training Script, and more
Diffusion models are known for their abilities in the space of generative modeling. This release of diffusers introduces the first official pipeline (Marigold) for discriminative tasks such as depth estimation and surface normals’ estimation!
Starting this release, we will also highlight the changes and features from the library that make it easy to integrate community checkpoints, features, and so on. Read on!
Marigold
Proposed in Marigold: Repurposing Diffusion-Based Image Generators for Monocular Depth Estimation, Marigold introduces a diffusion model and associated fine-tuning protocol for monocular depth estimation. It can also be extended to perform surface normals’ estimation.
(Image taken from the official repository)
The code snippet below shows how to use this pipeline for depth estimation:
import diffusers
import torch
pipe = diffusers.MarigoldDepthPipeline.from_pretrained(
    "prs-eth/marigold-depth-lcm-v1-0", variant="fp16", torch_dtype=torch.float16
).to("cuda")
image = diffusers.utils.load_image("https://marigoldmonodepth.github.io/images/einstein.jpg")
depth = pipe(image)
vis = pipe.image_processor.visualize_depth(depth.prediction)
vis[0].save("einstein_depth.png")
depth_16bit = pipe.image_processor.export_depth_to_16bit_png(depth.prediction)
depth_16bit[0].save("einstein_depth_16bit.png")Check out the API documentation here. We also have a detailed guide about the pipeline here.
Thanks to @toshas, one of the authors of Marigold, who contributed this in #7847.
🌀 Massive Refactor of from_single_file 🌀
We have further refactored from_single_file to align its logic more closely to the from_pretrained method. The biggest benefit of doing this is that it allows us to expand single file loading support beyond Stable Diffusion-like pipelines and models. It also makes it easier to load models that are saved and shared in their original format.
Some of the changes introduced in this refactor:
- When loading a single file checkpoint, we will attempt to use the keys present in the checkpoint to infer a model repository on the Hugging Face Hub that we can use to configure the pipeline. For example, if you are using a single file checkpoint based on SD 1.5, we would use the configuration files in the runwayml/stable-diffusion-v1-5repository to configure the model components and pipeline.
- Suppose this inferred configuration isn’t appropriate for your checkpoint. In that case, you can override it using the configargument and pass in either a path to a local model repo or a repo id on the Hugging Face Hub.
pipe = StableDiffusionPipeline.from_single_file("...", config=<model repo id or local repo path>) - Deprecation of model configuration arguments for the from_single_filemethod in Pipelines such asnum_in_channels,scheduler_type,image_sizeandupcast_attention. This is an anti-pattern that we have supported in previous versions of the library when we assumed that it would only be relevant to Stable Diffusion based models. However, given that there is a demand to support other model types, we feel it is necessary for single-file loading behavior to adhere to the conventions set in our other loading methods. Configuring individual model components through a pipeline loading method is not something we support infrom_pretrained, and therefore, we will be deprecating support for this behavior infrom_single_fileas well.
PixArt Sigma
PixArt Simga is the successor to PixArt Alpha. PixArt Sigma is capable of directly generating images at 4K resolution. It can also produce images of markedly higher fidelity and improved alignment with text prompts. It comes with a massive sequence length of 300 (for reference, PixArt Alpha has a maximum sequence length of 120)!
import torch
from diffusers import PixArtSigmaPipeline
# You can replace the checkpoint id with "PixArt-alpha/PixArt-Sigma-XL-2-512-MS" too.
pipe = PixArtSigmaPipeline.from_pretrained(
    "PixArt-alpha/PixArt-Sigma-XL-2-1024-MS", torch_dtype=torch.float16
)
# Enable memory optimizations.
pipe.enable_model_cpu_offload()
prompt = "A small cactus with a happy face in the Sahara desert."
image = pipe(prompt).images[0]📃 Refer to the documentation here to learn more about PixArt Sigma.
Thanks to @lawrence-cj, one of the authors of PixArt Sigma, who contributed this in #7857.
AnimateDiff SDXL
@a-r-r-o-w contributed the Stable Diffusion XL (SDXL) version of AnimateDiff in #6721. However, note that this is currently an experimental feature, as only a beta release of the motion adapter checkpoint is available.
import torch
from diffusers.models import MotionAdapter
from diffusers import AnimateDiffSDXLPipeline, DDIMScheduler
from diffusers.utils import export_to_gif
adapter = MotionAdapter.from_pretrained("guoyww/animatediff-motion-adapter-sdxl-beta", torch_dtype=torch.float16)
model_id = "stabilityai/stable-diffusion-xl-base-1.0"
scheduler = DDIMScheduler.from_pretrained(
    model_id,
    subfolder="scheduler",
    clip_sample=False,
    beta_schedule="linear",
    steps_offset=1,
)
pipe = AnimateDiffSDXLPipeline.from_pretrained(
    model_id,
    motion_adapter=adapter,
    scheduler=scheduler,
    torch_dtype=torch.float16,
    variant="fp16",
).enable_model_cpu_offload()
# enable memory savings
pipe.enable_vae_slicing()
pipe.enable_vae_tiling()
output = pipe(
    prompt="a panda surfing in the ocean, realistic, high quality",
    negative_prompt="low quality, worst quality",
    num_inference_steps=20,
    guidance_scale=8,
    width=1024,
    height=1024,
    num_frames=16,
)
frames = output.frames[0]
export_to_gif(frames, "animation.gif")📜 Refer to the documentation to learn more.
Block-wise LoRA
@UmerHA contributed the support to control the scales of different LoRA blocks in a granular manner in #7352. Depending on the LoRA checkpoint one is using, this granular control can significantly impact the quality of the generated outputs. Following code block shows how this feature can be used while performing inference:
...
adapter_weight_scales = { "unet": { "down": 0, "mid": 1, "up": 0} }
pipe.set_adapters("pixel", adapter_weight_scales)
image = pipe(
		prompt, num_inference_steps=30, generator=torch.manual_seed(0)
).images[0]✍️ Refer to our documentation for more details and a full-fledged example.
InstantStyle
More granular control of scale could be extended to IP-Adapters too. @DannHuang contributed to the support of InstantStyle, aka granular control of IP-Adapter scales, in #7668. The following code block shows how this feature could be used when performing inference with IP-Adapters:
...
scale = {
    "down": {"block_2": [0.0, 1.0]},
    "up": {"block_0": [0.0, 1.0, 0.0]},
}
pipeline.set_ip_adapter_scale(scale)This way, one can generate images following only the style or layout from the image prompt, with significantly improved diversity. This is achieved by only activating IP-Adapters to specific parts of the model.
Check out the documentation here.
ControlNetXS
ControlNet-XS was introduced in ControlNet-XS by Denis Zavadski and Carsten Rother. Based on the observation, the control model in the original ControlNet can be made much smaller and still produce good results. ControlNet-XS generates images comparable to a regular ControlNet, but it is 20-25% faster (see benchmark with StableDiffusion-XL) and uses ~45% less memory.
ControlNet-XS is supported for both Stable Diffusion and Stable Diffusion XL.
Thanks to @UmerHA for contributing ControlNet-XS in #5827 and #6772.
Custom Timesteps
We introduced custom timesteps support for some of our pipelines and schedulers. You can now set your scheduler with a list of arbitrary timesteps. For example, you can use the AYS timesteps schedule to achieve very nice results with only 10 denoising steps.
from diffusers.schedulers import AysSchedules
sampling_schedule = AysSchedules["StableDiffusionXLTimesteps"]
pipe = StableDiffusionXLPipeline.from_pretrained(
    "SG16...



