-
Notifications
You must be signed in to change notification settings - Fork 6.5k
docs: introduce cache-dit to diffusers #12351
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for the very comprehensive PR description. Based on the results, I think it might be cool to try to support this caching technique more natively in Diffusers. We have a few and here is the doc: https://huggingface.co/docs/diffusers/main/en/optimization/cache. Would you be willing to contribute that? |
@sayakpaul yes, my pleasure~ However, I don't think this is something that can be accomplished in the short term. Many designs in cache-dit (such as BlockAdapter) may not be directly integrated into Diffusers. Meanwhile, cache-dit incorporates various hybrid strategies for Cache algorithms and will continuously integrate new ones like TaylorSeer and FoCa, which may also make it difficult to maintain full synchronization with Diffusers. In the short term, I believe it would be more appropriate to consider adding cache-dit to the Community Optimization. |
|
Sure upto you. But having the integration natively supported also helps with adoption, testing, and maintainence. |
I will submit a new PR to support DBCache + TaylorSeer instead of the entire cache-dit, which I think will be easier to implement. I can find some references, such as: https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/first_block_cache.py and https://github.com/huggingface/diffusers/blob/main/src/diffusers/hooks/faster_cache.py Another way is to integrate the codebase of cache-dit into diffusers, but this may be quite inelegant and will also involve a great deal of modifications. |
|
Let's hold off for now. The cache-dit team is simplifying the API, and we'll resubmit the PR once that's done. |
|
No problem but please give us a heads up once you feel it's ready :) More than happy to revive that. Also, I am happy to help with reviewing cache-dit if you think that would be beneficial :) |
Hi~, I'm the maintainer of cache-dit. I'd like to introduce cache-dit: A Unified, Flexible and Training-free Cache Acceleration Framework for 🤗Diffusers. 🎉Now, cache-dit covers almost All Diffusers' DiT Pipelines🎉. I think this should be the first cache acceleration system in the community that fully supports 🤗 Diffusers. For more information, please refer to the following details.
@a-r-r-o-w @sayakpaul
A Unified, Flexible and Training-free Cache Acceleration Framework for 🤗Diffusers
♥️ Cache Acceleration with One-line Code ~ ♥️
📚Unified Cache APIs | 📚Forward Pattern Matching | 📚Automatic Block Adapter
📚Hybrid Forward Pattern | 📚DBCache | 📚Hybrid TaylorSeer | 📚Cache CFG
📚Text2Image DrawBench | 📚Text2Image Distillation DrawBench
🎉Now, cache-dit covers almost All Diffusers' DiT Pipelines🎉
🔥Qwen-Image | FLUX.1 | Qwen-Image-Lightning | Wan 2.1 | Wan 2.2 🔥
🔥HunyuanImage-2.1 | HunyuanVideo | HunyuanDiT | HiDream | AuraFlow🔥
🔥CogView3Plus | CogView4 | LTXVideo | CogVideoX | CogVideoX 1.5 | ConsisID🔥
🔥Cosmos | SkyReelsV2 | VisualCloze | OmniGen 1/2 | Lumina 1/2 | PixArt🔥
🔥Chroma | Sana | Allegro | Mochi | SD 3/3.5 | Amused | ... | DiT-XL🔥
🔥Wan2.2 MoE | +cache-dit:2.0x↑🎉 | HunyuanVideo | +cache-dit:2.1x↑🎉
🔥Qwen-Image | +cache-dit:1.8x↑🎉 | FLUX.1-dev | +cache-dit:2.1x↑🎉
🔥FLUX-Kontext-dev | Baseline | +cache-dit:1.3x↑🎉 | 1.7x↑🎉 | 2.0x↑ 🎉
🔥Qwen...Lightning | +cache-dit:1.14x↑🎉 | HunyuanImage | +cache-dit:1.7x↑🎉
🔥Qwen-Image-Edit | Input w/o Edit | Baseline | +cache-dit:1.6x↑🎉 | 1.9x↑🎉
🔥HiDream-I1 | +cache-dit:1.9x↑🎉 | CogView4 | +cache-dit:1.4x↑🎉 | 1.7x↑🎉
🔥CogView3 | +cache-dit:1.5x↑🎉 | 2.0x↑🎉| Chroma1-HD | +cache-dit:1.9x↑🎉
🔥Mochi-1-preview | +cache-dit:1.8x↑🎉 | SkyReelsV2 | +cache-dit:1.6x↑🎉
🔥VisualCloze-512 | Model | Cloth | Baseline | +cache-dit:1.4x↑🎉 | 1.7x↑🎉
🔥LTX-Video-0.9.7 | +cache-dit:1.7x↑🎉 | CogVideoX1.5 | +cache-dit:2.0x↑🎉
🔥OmniGen-v1 | +cache-dit:1.5x↑🎉 | 3.3x↑🎉 | Lumina2 | +cache-dit:1.9x↑🎉
🔥Allegro | +cache-dit:1.36x↑🎉 | AuraFlow-v0.3 | +cache-dit:2.27x↑🎉
🔥Sana | +cache-dit:1.3x↑🎉 | 1.6x↑🎉| PixArt-Sigma | +cache-dit:2.3x↑🎉
🔥PixArt-Alpha | +cache-dit:1.6x↑🎉 | 1.8x↑🎉| SD 3.5 | +cache-dit:2.5x↑🎉
🔥Asumed | +cache-dit:1.1x↑🎉 | 1.2x↑🎉 | DiT-XL-256 | +cache-dit:1.8x↑🎉
📖Contents
⚙️Installation
You can install the stable release of
cache-ditfrom PyPI:Or you can install the latest develop version from GitHub:
🔥Supported Pipelines
Currently, cache-dit library supports almost Any Diffusion Transformers (with Transformer Blocks that match the specific Input and Output patterns). Please check 🎉Examples for more details. Here are just some of the tested models listed.
Show all pipelines
🔥Benchmarks
cache-dit will support more mainstream Cache acceleration algorithms in the future. More benchmarks will be released, please stay tuned for update. Here, only the results of some precision and performance benchmarks are presented. The test dataset is DrawBench. For a complete benchmark, please refer to 📚Benchmarks.
📚Text2Image DrawBench: FLUX.1-dev
Comparisons between different FnBn compute block configurations show that more compute blocks result in higher precision. For example, the F8B0_W8MC0 configuration achieves the best Clip Score (33.007) and ImageReward (1.0333). Device: NVIDIA L20. F: Fn_compute_blocks, B: Bn_compute_blocks, 50 steps.
The comparison between cache-dit: DBCache and algorithms such as Δ-DiT, Chipmunk, FORA, DuCa, TaylorSeer and FoCa is as follows. Now, in the comparison with a speedup ratio less than 3x, cache-dit achieved the best accuracy. Please check 📚How to Reproduce? for more details.
Show all comparison
NOTE: Except for DBCache, other performance data are referenced from the paper FoCa, arxiv.2508.16211.
📚Text2Image Distillation DrawBench: Qwen-Image-Lightning
Surprisingly, cache-dit: DBCache still works in the extremely few-step distill model. For example, Qwen-Image-Lightning w/ 4 steps, with the F16B16 configuration, the PSNR is 34.8163, the Clip Score is 35.6109, and the ImageReward is 1.2614. It maintained a relatively high precision.
🎉Unified Cache APIs
📚Forward Pattern Matching
Currently, for any Diffusion models with Transformer Blocks that match the specific Input/Output patterns, we can use the Unified Cache APIs from cache-dit, namely, the
cache_dit.enable_cache(...)API. The Unified Cache APIs are currently in the experimental phase; please stay tuned for updates. The supported patterns are listed as follows:In most cases, you only need to call one-line of code, that is
cache_dit.enable_cache(...). After this API is called, you just need to call the pipe as normal. Thepipeparam can be any Diffusion Pipeline. Please refer to Qwen-Image as an example.🔥Automatic Block Adapter
But in some cases, you may have a modified Diffusion Pipeline or Transformer that is not located in the diffusers library or not officially supported by cache-dit at this time. The BlockAdapter can help you solve this problems. Please refer to 🔥Qwen-Image w/ BlockAdapter as an example.
For such situations, BlockAdapter can help you quickly apply various cache acceleration features to your own Diffusion Pipelines and Transformers. Please check the 📚BlockAdapter.md for more details.
📚Hybird Forward Pattern
Sometimes, a Transformer class will contain more than one transformer
blocks. For example, FLUX.1 (HiDream, Chroma, etc) contains transformer_blocks and single_transformer_blocks (with different forward patterns). The BlockAdapter can also help you solve this problem. Please refer to 📚FLUX.1 as an example.Even sometimes you have more complex cases, such as Wan 2.2 MoE, which has more than one Transformer (namely
transformerandtransformer_2) in its structure. Fortunately, cache-dit can also handle this situation very well. Please refer to 📚Wan 2.2 MoE as an example.📚Implement Patch Functor
For any PATTERN not {0...5}, we introduced the simple abstract concept of Patch Functor. Users can implement a subclass of Patch Functor to convert an unknown Pattern into a known PATTERN, and for some models, users may also need to fuse the operations within the blocks for loop into block forward.
Some Patch functors have already been provided in cache-dit: 📚HiDreamPatchFunctor, 📚ChromaPatchFunctor, etc. After implementing Patch Functor, users need to set the
patch_functorproperty of BlockAdapter.🤖Cache Acceleration Stats Summary
After finishing each inference of
pipe(...), you can call thecache_dit.summary()API on pipe to get the details of the Cache Acceleration Stats for the current inference.You can set
detailsparam asTrueto show more details of cache stats. (markdown table format) Sometimes, this may help you analyze what values of the residual diff threshold would be better.⚡️DBCache: Dual Block Cache
DBCache: Dual Block Caching for Diffusion Transformers. Different configurations of compute blocks (F8B12, etc.) can be customized in DBCache, enabling a balanced trade-off between performance and precision. Moreover, it can be entirely training-free. Please check DBCache.md docs for more design details.
DBCache, L20x1 , Steps: 28, "A cat holding a sign that says hello world with complex background"
🔥Hybrid TaylorSeer
We have supported the TaylorSeers: From Reusing to Forecasting: Accelerating Diffusion Models with TaylorSeers algorithm to further improve the precision of DBCache in cases where the cached steps are large, namely, Hybrid TaylorSeer + DBCache. At timesteps with significant intervals, the feature similarity in diffusion models decreases substantially, significantly harming the generation quality.
TaylorSeer employs a differential method to approximate the higher-order derivatives of features and predict features in future timesteps with Taylor series expansion. The TaylorSeer implemented in cache-dit supports both hidden states and residual cache types. That is$\mathcal{F}_{\text {pred }, m}\left(x_{t-k}^l\right)$ can be a residual cache or a hidden-state cache.
Important
Please note that if you have used TaylorSeer as the calibrator for approximate hidden states, the Bn param of DBCache can be set to 0. In essence, DBCache's Bn is also act as a calibrator, so you can choose either Bn > 0 or TaylorSeer. We recommend using the configuration scheme of TaylorSeer + DBCache FnB0.
DBCache F1B0 + TaylorSeer, L20x1, Steps: 28,
"A cat holding a sign that says hello world with complex background"
⚡️Hybrid Cache CFG
cache-dit supports caching for CFG (classifier-free guidance). For models that fuse CFG and non-CFG into a single forward step, or models that do not include CFG (classifier-free guidance) in the forward step, please set
enable_separate_cfgparam to False (default, None). Otherwise, set it to True. For examples:⚙️Torch Compile
By the way, cache-dit is designed to work compatibly with torch.compile. You can easily use cache-dit with torch.compile to further achieve a better performance. For example:
However, users intending to use cache-dit for DiT with dynamic input shapes should consider increasing the recompile limit of
torch._dynamo. Otherwise, the recompile_limit error may be triggered, causing the module to fall back to eager mode.Please check perf.py for more details.
🛠Metrics CLI
You can utilize the APIs provided by cache-dit to quickly evaluate the accuracy losses caused by different cache configurations. For example:
Please check test_metrics.py for more details. Or, you can use
cache-dit-metrics-clitool. For examples:©️Citations