Fixes for Dynamic LoRA and Triton autotune#20
Fixes for Dynamic LoRA and Triton autotune#20EricBCoding wants to merge 1 commit intoBobJohnson24:mainfrom
Conversation
Add robust dynamic-LoRA synchronization and Triton kernel/runtime improvements. - int8_dynamic_lora.py: add a model-patcher wrapper to sync dynamic LoRA settings from transformer_options, ensure global hook registration, and optimize the LoRA stack loader to batch-load multiple LoRAs into a cloned ModelPatcher in a single pass. - int8_fused_kernel.py: introduce environment-driven Triton runtime configuration and autotune toggle, provide fixed-config fallback, factor kernel strategy decorator, and minor formatting/launch improvements for the Triton fused quantize/GEMM kernels and Python wrapper. - int8_quant.py: add torch.compile disable fallback, small-batch fallback threshold, dynamic-LoRA debug flag, implement apply_dynamic_lora_delta for offset-aware LoRA application, refactor DynamicLoRAHook to compute stable IDs, sync from transformer_options, normalize patch/module keys, and populate per-module dynamic_lora_entries; update manual-cast Linear wrapper to use the new logic and efficient device transfers. - int8_unet_loader.py: fix a missing comma in excluded_names list. Also includes small compatibility and logging/debug improvements and safer device/stride handling. This change improves correctness and performance when applying dynamic LoRAs and when running the INT8 fused kernels under different Triton/autotune settings.
|
What's this do for speed? Compile seems kinda essential.. although I think the previous version compiled by itself even with a post compile node. There's a few configs there that should be added.. i.e. using 2 stages gave a slight speedup. My issue was that it would thrash autotune as soon as you sent it another prompt. Guess I will try it and see after adding in my "good" configs.
This is kinda obtuse AF. People use a lot of different WF in comfy and turning settings into env vars is a huge step back. Who will even remember these things for all the different nodes you use. |
| # ============================================================================= | ||
|
|
||
| @triton.autotune( | ||
| configs=[ |
There was a problem hiding this comment.
how much would it help here to add cache_results=True? Perhaps that would stop the thrashing even without picking a fixed config.
|
@Ph0rk0z Thank you for the feedback! In response to your questions:
We can definitely turn these into inputs - I just added them as env vars for now to avoid changing nodes in a PR that already changes a lot. They could either be added to the model loader or as a separate node. Thoughts?
This is also currently parsed as an environment variable,
Speed-wise, it's similar to the main repo on my device, just without the autotune thrash. Quick test making an 8-step, 1MP image using Z-Turbo on my Geforce 3090:
I haven't tried setting |
|
I repeated your results for caching the config. It still recompiled like you said. I never really tried dynamic lora and have been using stochastic. Bear in mind, GPU I use for this is turning so it's slower than ampere and config could be better/worse on other arch. My chroma no caching ends up like 1.15-1.20 it/s. ZIT I use nunchaku. IMO would be wiser to separate some of these things: 5) Loader typo fix can be a PR on it's own and an easy win. Same for dynamic lora changes. Then again I'm not the repo owner so I can't really make the call. A ton of this PR just re-aranges things and changes comments too which seems counter-productive. Kinda stinks that there seems to be no way to get around the fixed configs. Someone doing 40s deliverable style outputs probably doesn't notice but I do spamming one-time use images in my chat client. Changing loras is also a big hit and takes like 30s, maybe dynamic helps with that? |
|
Thanks - let's see what @BobJohnson24's preferences are re: pull requests and I can try to disentangle some of these changes if it's easier for him. 🙂
That's with the PR? I, too, was getting crazy load times for LoRAs on the main repo, but with this PR it's about as fast as using LoRAs without INT8 (both stochastic and dynamic). Maybe 4-5 seconds with ZIT. Could be a Nunchaku-related thing? I'm not using Nunchaku at the moment. |
|
No, so far I used this loader for chroma only. Nunchaku loras are very fast along with inference. I have the klein model too but I didn't try it yet. I think changing loras with the old way caused a recompile. |
Summary
This PR fixes dynamic LoRA + torch.compile interoperability, removes shape-driven Triton autotune overhead by default (should resolve #13), and resolves fallback dtype/correctness bugs in INT8 inference paths. Prevents lengthy model recompilation times on Comfy first runs.
Disclosure: some patches made with assistance of Codex 5.3, so please test carefully before merging. It's working well on my device (specifically with Z-Image Turbo), but there are a lot of code changes here. Feedback appreciated.
Changes
1) Triton compile/autotune stability
int8_fused_kernel.pyINT8_TRITON_AUTOTUNE=1).2) Dynamic LoRA sync under Torch Compile
int8_dynamic_lora.py,int8_quant.pytransformer_options["dynamic_loras"]before compiled forward runs.diffusion_model.*,model.*, and_orig_mod.*naming variants.3) Dynamic LoRA correctness for packed/sliced layers
int8_quant.py4) Fallback dtype bug fix
int8_quant.pyF.linear, avoiding dtype mismatch/precision issues.INT8_SMALL_BATCH_FALLBACK_MAX_ROWS).5) Loader typo fix
int8_unet_loader.py'time_projection', 'head') caused by a missing comma.