Skip to content

Fixes for Dynamic LoRA and Triton autotune#20

Open
EricBCoding wants to merge 1 commit intoBobJohnson24:mainfrom
EricBCoding:main
Open

Fixes for Dynamic LoRA and Triton autotune#20
EricBCoding wants to merge 1 commit intoBobJohnson24:mainfrom
EricBCoding:main

Conversation

@EricBCoding
Copy link

@EricBCoding EricBCoding commented Feb 7, 2026

Summary

This PR fixes dynamic LoRA + torch.compile interoperability, removes shape-driven Triton autotune overhead by default (should resolve #13), and resolves fallback dtype/correctness bugs in INT8 inference paths. Prevents lengthy model recompilation times on Comfy first runs.

Disclosure: some patches made with assistance of Codex 5.3, so please test carefully before merging. It's working well on my device (specifically with Z-Image Turbo), but there are a lot of code changes here. Feedback appreciated.

Changes

1) Triton compile/autotune stability

  • File: int8_fused_kernel.py
  • Added fixed kernel config path (default) to avoid per-shape Triton autotune churn.
  • Added env controls for kernel params and optional autotune re-enable (INT8_TRITON_AUTOTUNE=1).
  • Ensured fixed config globals are defined early so TorchDynamo guards do not fail on missing module attributes.

2) Dynamic LoRA sync under Torch Compile

  • Files: int8_dynamic_lora.py, int8_quant.py
  • Added an APPLY_MODEL wrapper to force sync of transformer_options["dynamic_loras"] before compiled forward runs.
  • Added stable dynamic-LoRA identity handling to prevent unnecessary recomposition/recompile behavior.
  • Added robust module key normalization for diffusion_model.*, model.*, and _orig_mod.* naming variants.

3) Dynamic LoRA correctness for packed/sliced layers

  • File: int8_quant.py
  • Added offset-aware dynamic LoRA application for tuple patch keys carrying slice metadata.
  • Handles both input-sliced and output-sliced adapters safely.
  • Prevents shape-mismatch crashes (e.g. packed qkv dimensions) and ensures LoRA actually affects outputs.

4) Fallback dtype bug fix

  • File: int8_quant.py
  • Fixed small-batch fallback path to cast bias to input dtype before F.linear, avoiding dtype mismatch/precision issues.
  • Added configurable small-batch fallback threshold (INT8_SMALL_BATCH_FALLBACK_MAX_ROWS).

5) Loader typo fix

  • File: int8_unet_loader.py
  • Fixed WAN exclusion list typo ('time_projection', 'head') caused by a missing comma.

Add robust dynamic-LoRA synchronization and Triton kernel/runtime improvements.

- int8_dynamic_lora.py: add a model-patcher wrapper to sync dynamic LoRA settings from transformer_options, ensure global hook registration, and optimize the LoRA stack loader to batch-load multiple LoRAs into a cloned ModelPatcher in a single pass.
- int8_fused_kernel.py: introduce environment-driven Triton runtime configuration and autotune toggle, provide fixed-config fallback, factor kernel strategy decorator, and minor formatting/launch improvements for the Triton fused quantize/GEMM kernels and Python wrapper.
- int8_quant.py: add torch.compile disable fallback, small-batch fallback threshold, dynamic-LoRA debug flag, implement apply_dynamic_lora_delta for offset-aware LoRA application, refactor DynamicLoRAHook to compute stable IDs, sync from transformer_options, normalize patch/module keys, and populate per-module dynamic_lora_entries; update manual-cast Linear wrapper to use the new logic and efficient device transfers.
- int8_unet_loader.py: fix a missing comma in excluded_names list.

Also includes small compatibility and logging/debug improvements and safer device/stride handling. This change improves correctness and performance when applying dynamic LoRAs and when running the INT8 fused kernels under different Triton/autotune settings.
@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 9, 2026

What's this do for speed? Compile seems kinda essential.. although I think the previous version compiled by itself even with a post compile node. There's a few configs there that should be added.. i.e. using 2 stages gave a slight speedup.

My issue was that it would thrash autotune as soon as you sent it another prompt. Guess I will try it and see after adding in my "good" configs.

Added env controls for kernel params and optional autotune re-enable (INT8_TRITON_AUTOTUNE=1).

This is kinda obtuse AF. People use a lot of different WF in comfy and turning settings into env vars is a huge step back. Who will even remember these things for all the different nodes you use.

# =============================================================================

@triton.autotune(
configs=[
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how much would it help here to add cache_results=True? Perhaps that would stop the thrashing even without picking a fixed config.

@EricBCoding
Copy link
Author

EricBCoding commented Feb 9, 2026

@Ph0rk0z Thank you for the feedback!

In response to your questions:

What's this do for speed? Compile seems kinda essential..

torch.compile is still active for the model path. The only places I disabled Dynamo are the small helper paths that were causing guard/recompile churn (int8_forward_dynamic + dynamic LoRA sync/apply helpers).

People use a lot of different WF in comfy and turning settings into env vars is a huge step back.

We can definitely turn these into inputs - I just added them as env vars for now to avoid changing nodes in a PR that already changes a lot. They could either be added to the model loader or as a separate node. Thoughts?

how much would it help here to add cache_results=True? Perhaps that would stop the thrashing even without picking a fixed config.

cache_results=True is enabled on the autotune path in this PR. In my testing, this alone isn't enough to fully solve #13 because prompt/conditioning changes can generate new shape keys and trigger fresh autotune passes. I think the need for a fixed config might be unavoidable. (though happy to be proven wrong on this)

There's a few configs there that should be added.. i.e. using 2 stages gave a slight speedup.

This is also currently parsed as an environment variable, INT8_TRITON_NUM_STAGES here

What's this do for speed?

Speed-wise, it's similar to the main repo on my device, just without the autotune thrash. Quick test making an 8-step, 1MP image using Z-Turbo on my Geforce 3090:

  • ~1.95it/s using a stochastic LoRA
  • ~1.65it/s using a dynamic LoRA

I haven't tried setting num_stages to 2 yet, though!

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 9, 2026

I repeated your results for caching the config. It still recompiled like you said. I never really tried dynamic lora and have been using stochastic. Bear in mind, GPU I use for this is turning so it's slower than ampere and config could be better/worse on other arch. My chroma no caching ends up like 1.15-1.20 it/s. ZIT I use nunchaku.

IMO would be wiser to separate some of these things: 5) Loader typo fix can be a PR on it's own and an easy win. Same for dynamic lora changes. Then again I'm not the repo owner so I can't really make the call. A ton of this PR just re-aranges things and changes comments too which seems counter-productive.

Kinda stinks that there seems to be no way to get around the fixed configs. Someone doing 40s deliverable style outputs probably doesn't notice but I do spamming one-time use images in my chat client.

Changing loras is also a big hit and takes like 30s, maybe dynamic helps with that?

@EricBCoding
Copy link
Author

EricBCoding commented Feb 9, 2026

Thanks - let's see what @BobJohnson24's preferences are re: pull requests and I can try to disentangle some of these changes if it's easier for him. 🙂

Changing loras is also a big hit and takes like 30s, maybe dynamic helps with that?

That's with the PR? I, too, was getting crazy load times for LoRAs on the main repo, but with this PR it's about as fast as using LoRAs without INT8 (both stochastic and dynamic). Maybe 4-5 seconds with ZIT.

Could be a Nunchaku-related thing? I'm not using Nunchaku at the moment.

@Ph0rk0z
Copy link
Contributor

Ph0rk0z commented Feb 10, 2026

No, so far I used this loader for chroma only. Nunchaku loras are very fast along with inference. I have the klein model too but I didn't try it yet. I think changing loras with the old way caused a recompile.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Conditioning is slower?

2 participants