Skip to content

Add Windows/clang-cl support for AMD HIP backend#179

Merged
woct0rdho merged 1 commit intowoct0rdho:release/3.5.x-windowsfrom
jammm:jam/windows_amd
Dec 30, 2025
Merged

Add Windows/clang-cl support for AMD HIP backend#179
woct0rdho merged 1 commit intowoct0rdho:release/3.5.x-windowsfrom
jammm:jam/windows_amd

Conversation

@jammm
Copy link
Collaborator

@jammm jammm commented Dec 30, 2025

This allows triton running on AMD GPUs on windows via. TheRock wheels - https://github.com/ROCm/TheRock/blob/main/RELEASES.md

It should build as-is with the same build process as @woct0rdho as it only modifies .py and a .c file that's compiled at runtime.

Whenever you run a program that requires triton, make sure to set the following environment variables:

  • CC and CXX to clang-cl
  • ROCM_HOME to rocm-sdk path --root;$PATH
  • DISTUTILS_USE_SDK=1

Summary of changes:

  • Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym
  • Use rocm_sdk.find_libraries() to locate amdhip64
  • Add platform-specific macros for dynamic library loading
  • Escape Windows paths for C string embedding
  • Treat clang-cl as MSVC-compatible compiler in build.py
  • Fix NamedTemporaryFile handling on Windows in compiler.py

- Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym
- Use rocm_sdk.find_libraries() to locate amdhip64
- Add platform-specific macros for dynamic library loading
- Escape Windows paths for C string embedding
- Treat clang-cl as MSVC-compatible compiler in build.py
- Fix NamedTemporaryFile handling on Windows in compiler.py
@woct0rdho
Copy link
Owner

woct0rdho commented Dec 30, 2025

Looks good to me! I haven't followed the modern AMD toolchain for a while, but if this is enough to make it work, then it will not add much maintenance cost.

Maybe you can also tell people at https://github.com/patientx/ComfyUI-Zluda and https://github.com/lshqqytiger/triton about this.

@woct0rdho woct0rdho merged commit 9d87bfc into woct0rdho:release/3.5.x-windows Dec 30, 2025
@woct0rdho
Copy link
Owner

woct0rdho commented Dec 30, 2025

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

@woct0rdho
Copy link
Owner

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

@jammm
Copy link
Collaborator Author

jammm commented Dec 30, 2025

Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI.

Just tried triton_windows-3.5.1.post23-cp312-cp312-win_amd64.whl and ran it on TurboDiffusion which uses flash-attn triton backend via. FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" and it seems to work fine (albeit it didn't use the triton FA kernels as SpargeAttn HIP kernels were used. But it did use layernorm, qk quantize etc. triton kernels which are in flash-attn).

generated_video.mp4

@alexsarmiento
Copy link

The python test examples from Triton finally work with my gfx1100
. Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 2, 2026

The python test examples from Triton finally work with my gfx1100 . Also, sageattention1.0.6 works in comfyui and i am getting faster generations with some workflows

But when I try torch.compile via inductor in comfyui, it fails:

Assertion failed: llvm::isUIntN(BitWidth, val) && "Value is not an N-bit unsigned value", file D:\a\triton\triton\llvm-project\llvm\include\llvm/ADT/APInt.h, line 128

But this seems to be a bug with llvm.

How did you achieve this? I have an Rx7600 and can I do this? can you share your ComfyUI-run.bat or the args and env variables you are using? and is it Comfy-UI official or ZLUDA?

@patientx
Copy link

patientx commented Jan 3, 2026

I've cherry-picked this onto the upcoming release/3.6.x-windows branch and you can test it. The wheel is at https://github.com/Comfy-Org/wheels/actions/runs/20599020739

First of all ,thanks for working on this , much appricated.
Does installing it like this ok since it seems like you added it : " "pip install triton-windows" which installs "triton-windows 3.5.1.post23" I then installed sage-attention with "pip install sageattention==1.0.6"and flash-attention also. But in the end both gave errors. I set up the parameters like this in starter batch for comfy :

set CC=clang-cl
set CXX=clang-cl
set DISTUTILS_USE_SDK=1
for /f "delims=" %%i in ('python -c "import rocm; print(rocm.path[0])"') do set ROCM_HOME=%%i

since rocm is installed as package that last one should work right ?

@jammm
Copy link
Collaborator Author

jammm commented Jan 3, 2026 via email

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

I’m curious about Sage too...

@patientx
Copy link

patientx commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 3, 2026

Haven’t tried building sage attention but you could follow the environment variable setup here https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I also compiled SpargeAttn but how exactly am i supposed to use it with comfyUI, any idea

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.

edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

@patientx
Copy link

patientx commented Jan 3, 2026

It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work.
edit : sage-attention works with the patches on my comfyui-zluda fork.

I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention?

I'll try Sage myself soon too. What about the performance vs SDPA flash?

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

  • install sage-attention (v1) with this : `pip install sageattention==1.0.6'
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

@0xDELUXA
Copy link

0xDELUXA commented Jan 3, 2026

On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.

  • install sage-attention (v1) with this : `pip install sageattention==1.0.6'
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block.py,
curl -s -o venv\Lib\site-packages\sageattention\attn_qk_int8_per_block_causal.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/attn_qk_int8_per_block_causal.py 
curl -s -o venv\Lib\site-packages\sageattention\quant_per_block.py https://raw.githubusercontent.com/patientx/ComfyUI-Zluda/refs/heads/master/comfy/customzluda/sa/quant_per_block.py venv\Lib\site-packages\sageattention\quant_per_block.py

Int8? I think these are for RDNA2 or 3, don't think RDNA4 needs them. Will try soon tho

Edit: Yes, it does. I’m getting a lot of errors coming from venv\Lib\site-packages\sageattention\attn_qk_int8_per_block.py.
I think we could consider patching those three files in this repo for AMD only, by default (my suggestion, feel free to ignore).

@sfinktah
Copy link

sfinktah commented Jan 4, 2026

Yes, that works. It did do a nasty crash the first time, which I am saving here not as a complaint, but as a reference for a possible side project: "Write a program to force an AMD driver crash in order to free up all that VRAM that dwm never gives back."

Sampling 81 frames at 640x640 with 2 steps
  0%|                                                                                                                                                                                                                                                                                                                                                                                                                | 0/2 [00:00<?, ?it/s]Generated new RoPE frequencies
Exception Code: 0xC0000005
 #0 0x00007ff8c47c06eb (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x9206eb)
 #1 0x00007ff8c42f4315 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x454315)
 #2 0x00007ff8c432ef47 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48ef47)
 #3 0x00007ff8c432dec6 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48dec6)
 #4 0x00007ff8c432e1b4 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x48e1b4)
 #5 0x00007ff8c431b105 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x47b105)
 #6 0x00007ff8c429010f (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f010f)
 #7 0x00007ff8c4290231 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3f0231)
 #8 0x00007ff8c42b3a86 (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x413a86)
 #9 0x00007ff8c424b0ff (C:\zluda\comfy-rock\venv\Lib\site-packages\_rocm_sdk_core\bin\amdhip64_7.dll+0x3ab0ff)
#10 0x00007ff999b5259d (C:\WINDOWS\System32\KERNEL32.DLL+0x1259d)
#11 0x00007ff99bdeaf78 (C:\WINDOWS\SYSTEM32\ntdll.dll+0x5af78)

FYI the ZLUDA sageattn is basically just a patch to change the parameters to

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

Otherwise it uses too much "shared memory" and produces black screens. See also https://raw.githubusercontent.com/sfinktah/amd-torch/refs/heads/main/patches/sageattention-1.0.6+sfinktah+env-py3-none-any.patch which is an environment variable adjustable version of the same thing.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 4, 2026

BLOCK_M: 32, BLOCK_N: 16, STAGE: 1, waves_per_eu: 3 or 4, num_warps: 2, num_ctas: 1, num_stages: 1

I have tried many many combinations and besides these and none of them worked, most of the time i got garbled noise instead of image and sometimes i got a black and white frame for what was supposed to be the subject, i have an Rx7600

@rwfsmith
Copy link

rwfsmith commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

Using an SDXL fine-tune using a DMD2 lora and an upscaler KSampler step
Base resolution: 1024x1496
Upscale resolution: 1536x2240
8 steps on both
CFG: 1
lcm/exponential

Compared using flash attention with pytorch cross attention across 10 image generations, listing the average it/s for both the base and upscaler samplers at the end.

Flash Cross
29.11 31.74
28.84 32.48
27.92 30.7
27.86 30.58
28.1 30.67
28.04 30.61
27.98 30.7
28.07 30.67
28.04 30.88
28.11 30.76
28.207 30.979
1.83it/s 1.67it/s
1.36s/it 1.58s/it

edit: I also just tested with sage attention 1, but the results seem to be the same as cross attention

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

how did you do that? I have also compiled Sparge and Sage but haven't tried Flash yet, Flash 2 seems even better so how can I?
I have an Rx 7600 which one do you have?

@0xDELUXA
Copy link

0xDELUXA commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

I was also able, based on my results Flash 2 is slower than AOTriton SPDA Flash on RDNA4.

@rwfsmith
Copy link

rwfsmith commented Jan 6, 2026

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

how did you do that? I have also compiled Sparge and Sage but haven't tried Flash yet, Flash 2 seems even better so how can I? I have an Rx 7600 which one do you have?

just followed the AMD steps in the Flash Attention repo readme file, it worked pretty well. I've been having trouble getting the Sparge/Sage stuff working after building it.

I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results.

I was also able, based on my results Flash 2 is slower than AOTriton SPDA Flash on RDNA4.

I think that's because AOTriton has FP8 kernels for RDNA4? I think I read something about that in another thread. That may have been about why it's faster than SageAttention 1, but that could be a similar cause. I'm mostly interested in using Flash Attention for training, since it's supposed to speed it up quite a bit and help reduce memory usage.

@0xDELUXA
Copy link

0xDELUXA commented Jan 6, 2026

I think that's because AOTriton has FP8 kernels for RDNA4? I think I read something about that in another thread. That may have been about why it's faster than SageAttention 1, but that could be a similar cause. I'm mostly interested in using Flash Attention for training, since it's supposed to speed it up quite a bit and help reduce memory usage.

I think so too. Curious how will Sparge work on RNDA4 compared to the other attention mechanisms. For ex. in TurboDiffusion.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 6, 2026

just followed the AMD steps in the Flash Attention repo readme file, it worked pretty well. I've been having trouble getting the Sparge/Sage stuff working after building it.

are you on linux? cuz i have tried building it so many times on windows and somehow the link.exe just fails to link all the files

@0xDELUXA
Copy link

0xDELUXA commented Jan 6, 2026

are you on linux? cuz i have tried building it so many times on windows and somehow the link.exe just fails to link all the files

It works on Windows too. Don't forget to set $env:FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" before you build, otherwise it will try to use the CK backend, which isn't available on Windows.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 6, 2026

are you on linux? cuz i have tried building it so many times on windows and somehow the link.exe just fails to link all the files

It works on Windows too. Don't forget to set $env:FLASH_ATTENTION_TRITON_AMD_ENABLE="TRUE" before you build, otherwise it will try to use the CK backend, which isn't available on Windows.

I waited nearly an hour for it to even build and when i tried to use it i got this
image

Total VRAM 8176 MB, total RAM 16184 MB
pytorch version: 2.11.0a0+rocm7.11.0a20251217
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1102
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7600 : native
Using async weight offloading with 2 streams
Enabled pinned memory 7282.0


To use the `--use-flash-attention` feature, the `flash-attn` package must be installed first.
command:
        C:\ComfyUI\venv\Scripts\python.exe -m pip install flash-attn
Press any key to continue . . .```

@rwfsmith
Copy link

rwfsmith commented Jan 6, 2026

Odd. It only took a few seconds for mine to build since it shouldn't be building the kernels, instead using the triton-windows kernel.

Here are the exact steps I follow:

Open command prompt and clone repo (I prefer command prompt over powershell)

git clone https://github.com/Dao-AILab/flash-attention fa
cd fa

create rocmvariables.bat (not sure which of these are actually needed, but they work for me), save to fa folder

for /f "delims=" %%i in ('rocm-sdk path --root') do set "ROCM_ROOT=%%i"
for /f "delims=" %%i in ('rocm-sdk path --bin') do set "ROCM_BIN=%%i"

:: Set environment variables
set "ROCM_HOME=%ROCM_ROOT%"
set "PATH=%ROCM_ROOT%\lib\llvm\bin;%ROCM_BIN%;%PATH%"

:: Set compiler and build settings
set "CC=clang-cl"
set "CXX=clang-cl"
set "DISTUTILS_USE_SDK=1"

:: Enable experimental features
set "FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE"
set "TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1"

create and activate venv, install nightly packages

python -m venv venv
venv\scripts\activate
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ torch torchaudio torchvision "rocm[libraries,devel]"
rocm-sdk init
rocmvariables.bat

install triton-windows 3.6.0 package from this page

run build

python setup.py bdist_wheel 

or, if you want to use the current venv for your app

python setup.py install

Should just take a few seconds.
You can install that wheel in dist\ using pip for any other venv. I've read that you have to set "FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE" before you start an app or it won't use the correct code path, so I added it to my comfy.bat file that I run each time I load comfyui. I tried to also set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE, but I haven't seen a performance difference yet.

On a side note, TheRock windows nightly builds are working again and my image generates times shaved off around 1/5th of the time with these new ones compared to the last time it built back in mid-December. The initial image generation always seems to take a good 20 seconds longer than the subsequent ones, partially from loading the model, but also seems like the VAE decode steps take longer the first time after comfyui is started.

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 6, 2026

I followed your steps and it worked, built in a few seconds
I am using TheRock's builds of rocm7.1.1 and torch2.11 and triton3,5,1 from post23

[DONE] Security scan
** ComfyUI startup time: 2026-01-07 01:48:27.015
** Platform: Windows
** Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
** Python executable: C:\ComfyUI\venv\Scripts\python.exe
** ComfyUI Path: C:\ComfyUI
** ComfyUI Base Folder Path: C:\ComfyUI
** User directory: C:\ComfyUI\user
** ComfyUI-Manager config path: C:\ComfyUI\user\__manager\config.ini
** Log path: C:\ComfyUI\user\comfyui.log
[PRE] ComfyUI-Manager

Prestartup times for custom nodes:
   0.0 seconds: C:\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\custom_nodes\comfyui-easy-use

Checkpoint files will always be loaded safely.
Total VRAM 8176 MB, total RAM 16184 MB
pytorch version: 2.11.0a0+rocm7.11.0a20251217
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1102
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7600 : native
Using async weight offloading with 2 streams
Enabled pinned memory 7282.0
Traceback (most recent call last):
  File "C:\ComfyUI\main.py", line 177, in <module>
    import execution
  File "C:\ComfyUI\execution.py", line 16, in <module>
    from latent_preview import set_preview_method
  File "C:\ComfyUI\latent_preview.py", line 5, in <module>
    from comfy.sd import VAE
  File "C:\ComfyUI\comfy\sd.py", line 13, in <module>
    import comfy.ldm.genmo.vae.model
  File "C:\ComfyUI\comfy\ldm\genmo\vae\model.py", line 13, in <module>
    from comfy.ldm.modules.attention import optimized_attention
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 42, in <module>
    from flash_attn import flash_attn_func
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_interface.py", line 13, in <module>
    from .flash_attn_triton_amd import interface_fa as flash_attn_gpu
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\interface_fa.py", line 8, in <module>
    from .fwd_decode import attention_decode_forward_triton_impl
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_decode.py", line 58, in <module>
    (fwd_auto_tune_configs, fwd_autotune_keys), (reduce_auto_tune_configs, reduce_autotune_keys) = get_autotune_configs()
                                                                                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_decode.py", line 34, in get_autotune_configs
    raise ValueError("Unknown Device Type")
ValueError: Unknown Device Type
Press any key to continue . . .```

this is what i get when i try to run comfyui with this

```@echo off
setlocal

set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE

call "C:\ComfyUI\venv\Scripts\activate.bat"
python main.py --use-flash-attention --enable-manager --enable-manager-legacy-ui
pause ```

edit: i got it to work with set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE

@0xDELUXA
Copy link

0xDELUXA commented Jan 6, 2026

edit: i got it to work with set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE

Yeah, autotune only works on CDNA AFAIK.

@sfinktah
Copy link

sfinktah commented Jan 7, 2026

AMD's implementation of flash-attn for Triton was conspicuously missing code that would make it work much better (or at all) on RDNA. I had a play adding the required code a while back (based on the older PatientX/Zluda version) but ultimately I was just making up numbers that looked like they might be right (and very probably weren't).

See these patches: https://github.com/sfinktah/amd-torch/blob/main/patches/flash_attn-2.8.1%2Btriton_amd_git2df1727-py3-none-any.patch

You'll note the addition of an is_rnda and some additions to the function itself -- haven't checked it against the last flash-attn, but I'm sure @jammm could tweak it real nice, assuming he hasn't already done so.

@rwfsmith
Copy link

rwfsmith commented Jan 7, 2026

AMD's implementation of flash-attn for Triton was conspicuously missing code that would make it work much better (or at all) on RDNA. I had a play adding the required code a while back (based on the older PatientX/Zluda version) but ultimately I was just making up numbers that looked like they might be right (and very probably weren't).

See these patches: https://github.com/sfinktah/amd-torch/blob/main/patches/flash_attn-2.8.1%2Btriton_amd_git2df1727-py3-none-any.patch

You'll note the addition of an is_rnda and some additions to the function itself -- haven't checked it against the last flash-attn, but I'm sure @jammm could tweak it real nice, assuming he hasn't already done so.

Yeah, I was trying to use it to train a model and getting some odd errors. Seemed fine with diffusion models, I think? Sometimes comfyui will silently fallback to other methods when there's failures, though.

C:\projects\training\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_prefill.py:242:0: error: Failures have been detected while processing an MLIR pass pipeline
C:\projects\training\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_prefill.py:242:0: note: Pipeline failed while executing [TritonAMDFoldTrueCmpI on 'builtin.module' operation]: reproducer generated at std::errs, please share the reproducer above with Triton project.

@sfinktah
Copy link

@rwfsmith I confess I've never used flash-attention except as a way to save VRAM (and have WVW use sage-attention anyway). On Nvidia, it takes overnight to compile, and it still has a strange run-time warning. It's nice to know flash-attention is difficult for everyone, and it's not just an AMD thing.

Did those errors appear with my patches, with the ZLUDA/PatientX version, or some other variation?

One thing I learned while optimising various attentions is that some errors are fine as long as it picks a combination of parameters that result in a valid result. Quite possibly some of those errors could be a result of badly chosen parameters and if someone took the time to work out what was valid for what GPU then they might go away, but things tend to sort themselves out in the end. The exception being when something invalid DOESN'T generate an error, and you end up with all-black output. BLOCK_M: 64, BLOCK_N:16 is an example of this for sage attention when running under TheRock (but works fine, and is faster than 32x16 when using ZLUDA).

Also, I like to have these two environment variables set, it help you understand things a little more, and probably goes faster (assuming CACHE isn't default on, which I do not believe it is).

set TRITON_PRINT_AUTOTUNING=1
set TRITON_CACHE_AUTOTUNING=1

In (for example) the patch I linked above, you can see a bunch of numbers that some AI said might be good (don't trust AI for such things, they're horrific). You can also see a comment that I pasted out of my comy log, showing which one it selected.

# best config selected: BLOCK_M: 32, BLOCK_N: 16, waves_per_eu: 2, PRE_LOAD_V: False, num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None;
def get_rdna_autotune_configs():
    return [
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 1, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 32, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 16, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 32, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 32, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 16, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 16, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
         triton.Config( {"BLOCK_M": 16, "BLOCK_N": 16, "waves_per_eu": 1, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
    ], ['IS_CAUSAL', 'dropout_p', 'MAX_SEQLENS_Q', 'MAX_SEQLENS_K', 'ACTUAL_BLOCK_DMODEL', 'VARLEN', 'HQ', 'HK']

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 10, 2026

@rwfsmith I confess I've never used flash-attention except as a way to save VRAM (and have WVW use sage-attention anyway). On Nvidia, it takes overnight to compile, and it still has a strange run-time warning. It's nice to know flash-attention is difficult for everyone, and it's not just an AMD thing.

Did those errors appear with my patches, with the ZLUDA/PatientX version, or some other variation?

One thing I learned while optimising various attentions is that some errors are fine as long as it picks a combination of parameters that result in a valid result. Quite possibly some of those errors could be a result of badly chosen parameters and if someone took the time to work out what was valid for what GPU then they might go away, but things tend to sort themselves out in the end. The exception being when something invalid DOESN'T generate an error, and you end up with all-black output. BLOCK_M: 64, BLOCK_N:16 is an example of this for sage attention when running under TheRock (but works fine, and is faster than 32x16 when using ZLUDA).

Also, I like to have these two environment variables set, it help you understand things a little more, and probably goes faster (assuming CACHE isn't default on, which I do not believe it is).

set TRITON_PRINT_AUTOTUNING=1
set TRITON_CACHE_AUTOTUNING=1

In (for example) the patch I linked above, you can see a bunch of numbers that some AI said might be good (don't trust AI for such things, they're horrific). You can also see a comment that I pasted out of my comy log, showing which one it selected.

# best config selected: BLOCK_M: 32, BLOCK_N: 16, waves_per_eu: 2, PRE_LOAD_V: False, num_warps: 2, num_ctas: 1, num_stages: 1, maxnreg: None;
def get_rdna_autotune_configs():
    return [
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 64, "waves_per_eu": 1, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 32, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 64, "BLOCK_N": 16, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 32, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 32, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 16, "waves_per_eu": 4, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
        triton.Config( {"BLOCK_M": 32, "BLOCK_N": 16, "waves_per_eu": 2, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
         triton.Config( {"BLOCK_M": 16, "BLOCK_N": 16, "waves_per_eu": 1, "PRE_LOAD_V": False}, num_stages=1, num_warps=4, ),
    ], ['IS_CAUSAL', 'dropout_p', 'MAX_SEQLENS_Q', 'MAX_SEQLENS_K', 'ACTUAL_BLOCK_DMODEL', 'VARLEN', 'HQ', 'HK']

how did you even get it working, I have an Rx7600 and with these params-
set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE
I get

Flash Attention failed, using default SDPA: schema_.has_value() INTERNAL ASSERT FAILED at "B:\\src\\torch\\aten\\src\\ATen/core/dispatch/OperatorEntry.h":84, please report a bug to PyTorch. Tried to access the schema for  which doesn't have a schema registered yet
Flash Attention failed, using default SDPA: schema_.has_value() INTERNAL ASSERT FAILED at "B:\\src\\torch\\aten\\src\\ATen/core/dispatch/OperatorEntry.h":84, please report a bug to PyTorch. Tried to access the schema for  which doesn't have a schema registered yet
Flash Attention failed, using default SDPA: schema_.has_value() INTERNAL ASSERT FAILED at "B:\\src\\torch\\aten\\src\\ATen/core/dispatch/OperatorEntry.h":84, please report a bug to PyTorch. Tried to access the schema for  which doesn't have a schema registered yet 

a bunch of these

and with
set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE
i get

START] Security scan
[DONE] Security scan
** ComfyUI startup time: 2026-01-10 21:08:45.666
** Platform: Windows
** Python version: 3.12.10 (tags/v3.12.10:0cc8128, Apr  8 2025, 12:21:36) [MSC v.1943 64 bit (AMD64)]
** Python executable: C:\ComfyUI\venv\Scripts\python.exe
** ComfyUI Path: C:\ComfyUI
** ComfyUI Base Folder Path: C:\ComfyUI
** User directory: C:\ComfyUI\user
** ComfyUI-Manager config path: C:\ComfyUI\user\__manager\config.ini
** Log path: C:\ComfyUI\user\comfyui.log
[PRE] ComfyUI-Manager

Prestartup times for custom nodes:
   0.0 seconds: C:\ComfyUI\custom_nodes\rgthree-comfy
   0.0 seconds: C:\ComfyUI\custom_nodes\comfyui-easy-use

Checkpoint files will always be loaded safely.
Total VRAM 8176 MB, total RAM 16184 MB
pytorch version: 2.11.0a0+rocm7.11.0a20251217
Set: torch.backends.cudnn.enabled = False for better AMD performance.
AMD arch: gfx1102
ROCm version: (7, 2)
Set vram state to: NORMAL_VRAM
Device: cuda:0 AMD Radeon RX 7600 : native
Using async weight offloading with 2 streams
Enabled pinned memory 7282.0
Traceback (most recent call last):
  File "C:\ComfyUI\main.py", line 177, in <module>
    import execution
  File "C:\ComfyUI\execution.py", line 16, in <module>
    from latent_preview import set_preview_method
  File "C:\ComfyUI\latent_preview.py", line 5, in <module>
    from comfy.sd import VAE
  File "C:\ComfyUI\comfy\sd.py", line 13, in <module>
    import comfy.ldm.genmo.vae.model
  File "C:\ComfyUI\comfy\ldm\genmo\vae\model.py", line 13, in <module>
    from comfy.ldm.modules.attention import optimized_attention
  File "C:\ComfyUI\comfy\ldm\modules\attention.py", line 42, in <module>
    from flash_attn import flash_attn_func
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\__init__.py", line 3, in <module>
    from flash_attn.flash_attn_interface import (
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_interface.py", line 13, in <module>
    from .flash_attn_triton_amd import interface_fa as flash_attn_gpu
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\interface_fa.py", line 8, in <module>
    from .fwd_decode import attention_decode_forward_triton_impl
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_decode.py", line 73, in <module>
    (fwd_auto_tune_configs, fwd_autotune_keys), (reduce_auto_tune_configs, reduce_autotune_keys) = get_autotune_configs()
                                                                                                   ^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_decode.py", line 49, in get_autotune_configs
    raise ValueError("Unknown Device Type")
ValueError: Unknown Device Type
Press any key to continue . . .

So I give up even considering running it, PS this is the whole of my Launch.bat

@echo off
setlocal

set SAGE_ATTENTION_TRITON_AMD_ENABLE=TRUE
set SAGE_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE
set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE
set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE
set TRITON_PRINT_AUTOTUNING=1
set TRITON_CACHE_AUTOTUNING=1
set MIOPEN_FIND_MODE=2
set MIOPEN_LOG_LEVEL=3
set TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL=1
set PYTORCH_ROCM_ARCH=gfx1102
set HSA_OVERRIDE_GFX_VERSION=11.0.2
set HIP_VISIBLE_DEVICES=0

call "C:\ComfyUI\venv\Scripts\activate.bat"
python main.py --use-sage-attention --enable-manager --enable-manager-legacy-ui
pause

@0xDELUXA
Copy link

0xDELUXA commented Jan 10, 2026

how did you even get it working, I have an Rx7600 and with these params- set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE I get

Did you build Flash2 as: ROCm/TheRock#1278 (comment)?

and with set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE i get

set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE on RDNA will not work at all. It's CDNA only.

Also. we don't have things like:

set SAGE_ATTENTION_TRITON_AMD_ENABLE=TRUE
set SAGE_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE

@IxMxAMAR
Copy link

how did you even get it working, I have an Rx7600 and with these params- set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=FALSE I get

Did you build Flash2 as: ROCm/TheRock#1278 (comment)?

and with set FLASH_ATTENTION_TRITON_AMD_ENABLE=TRUE set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE i get

set FLASH_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE on RDNA will not work at all. It's CDNA only.

Also. we don't have things like:

set SAGE_ATTENTION_TRITON_AMD_ENABLE=TRUE
set SAGE_ATTENTION_TRITON_AMD_AUTOTUNE=TRUE

Can you suggest a stable RoCM and Pytorch version, I reinstalled from the latest TheRock builds and now keep getting this

got prompt
model weight dtype torch.bfloat16, manual cast: None
model_type FLOW
unet unexpected: ['norm_final.weight']
Using split attention in VAE
Using split attention in VAE
VAE load device: cuda:0, offload device: cpu, dtype: torch.bfloat16
CLIP/text encoder model load device: cuda:0, offload device: cpu, current: cpu, dtype: torch.float16
Requested to load ZImageTEModel_
loaded partially; 6612.93 MB usable, 6512.25 MB loaded, 1160.00 MB offloaded, 100.00 MB buffer reserved, lowvram patches: 0
0 models unloaded.
Unloaded partially: 100.37 MB freed, 6411.88 MB remains loaded, 100.00 MB buffer reserved, lowvram patches: 0
Requested to load Lumina2
FETCH ComfyRegistry Data [DONE]
[ComfyUI-Manager] default cache updated: https://api.comfy.org/nodes
FETCH DATA from: C:\ComfyUI\user\__manager\cache\1514988643_custom-node-list.json [DONE]
[ComfyUI-Manager] All startup tasks have been completed.
loaded partially; 5992.02 MB usable, 5767.02 MB loaded, 5972.52 MB offloaded, 225.00 MB buffer reserved, lowvram patches: 0
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]
rocblaslt error: Cannot read "C:\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_libraries_gfx110X_all\\bin\\hipblaslt\\library\\TensileLibrary_lazy_gfx1102.dat": No error

rocblaslt error: Could not load "C:\\ComfyUI\\venv\\Lib\\site-packages\\_rocm_sdk_libraries_gfx110X_all\\bin\\hipblaslt\\library\\TensileLibrary_lazy_gfx1102.dat"
hipModuleLoad failed: C:\ComfyUI\venv\Lib\site-packages\_rocm_sdk_libraries_gfx110X_all\bin\hipblaslt\library/Kernels.so-000-gfx1102.hsaco
 error: file not found
hipModuleLoad failed: C:\ComfyUI\venv\Lib\site-packages\_rocm_sdk_libraries_gfx110X_all\bin\hipblaslt\library/Kernels.so-000-gfx1102-xnack-.hsaco
 error: file not found
hipModuleLoad failed: C:\ComfyUI\venv\Lib\site-packages\_rocm_sdk_libraries_gfx110X_all\bin\hipblaslt\library/Kernels.so-000-gfx1102-xnack+.hsaco
 error: file not found
  0%|                                                                                           | 0/10 [00:00<?, ?it/s]
!!! Exception during processing !!! CUDA error: HIPBLAS_STATUS_INVALID_VALUE when calling `hipblasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`
Traceback (most recent call last):
  File "C:\ComfyUI\execution.py", line 518, in execute
    output_data, output_ui, has_subgraph, has_pending_tasks = await get_output_data(prompt_id, unique_id, obj, input_data_all, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                                                              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 329, in get_output_data
    return_values = await _async_map_node_over_list(prompt_id, unique_id, obj, input_data_all, obj.FUNCTION, allow_interrupt=True, execution_block_cb=execution_block_cb, pre_execute_cb=pre_execute_cb, v3_data=v3_data)
                    ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\execution.py", line 303, in _async_map_node_over_list
    await process_inputs(input_dict, i)
  File "C:\ComfyUI\execution.py", line 291, in process_inputs
    result = f(**inputs)
             ^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1538, in sample
    return common_ksampler(model, seed, steps, cfg, sampler_name, scheduler, positive, negative, latent_image, denoise=denoise)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\nodes.py", line 1505, in common_ksampler
    samples = comfy.sample.sample(model, noise, steps, cfg, sampler_name, scheduler, positive, negative, latent_image,
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\sample.py", line 60, in sample
    samples = sampler.sample(noise, positive, negative, cfg=cfg, latent_image=latent_image, start_step=start_step, last_step=last_step, force_full_denoise=force_full_denoise, denoise_mask=noise_mask, sigmas=sigmas, callback=callback, disable_pbar=disable_pbar, seed=seed)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1178, in sample
    return sample(self.model, noise, positive, negative, cfg, self.device, sampler, sigmas, self.model_options, latent_image=latent_image, denoise_mask=denoise_mask, callback=callback, disable_pbar=disable_pbar, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1068, in sample
    return cfg_guider.sample(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 1050, in sample
    output = executor.execute(noise, latent_image, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 994, in outer_sample
    output = self.inner_sample(noise, latent_image, device, sampler, sigmas, denoise_mask, callback, disable_pbar, seed, latent_shapes=latent_shapes)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 980, in inner_sample
    samples = executor.execute(self, sigmas, extra_args, callback, noise, latent_image, denoise_mask, disable_pbar)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 752, in sample
    samples = self.sampler_function(model_k, noise, sigmas, extra_args=extra_args, callback=k_callback, disable=disable_pbar, **self.extra_options)
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\utils\_contextlib.py", line 120, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\k_diffusion\sampling.py", line 1522, in sample_er_sde
    denoised = model(x, sigmas[i] * s_in, **extra_args)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 401, in __call__
    out = self.inner_model(x, sigma, model_options=model_options, seed=seed)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 953, in __call__
    return self.outer_predict_noise(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 960, in outer_predict_noise
    ).execute(x, timestep, model_options, seed)
      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 963, in predict_noise
    return sampling_function(self.inner_model, x, timestep, self.conds.get("negative", None), self.conds.get("positive", None), self.cfg, model_options=model_options, seed=seed)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 381, in sampling_function
    out = calc_cond_batch(model, conds, x, timestep, model_options)
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 206, in calc_cond_batch
    return _calc_cond_batch_outer(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 214, in _calc_cond_batch_outer
    return executor.execute(model, conds, x_in, timestep, model_options)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\samplers.py", line 326, in _calc_cond_batch
    output = model.apply_model(input_x, timestep_, **c).chunk(batch_chunks)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 162, in apply_model
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\model_base.py", line 204, in _apply_model
    model_output = self.diffusion_model(xc, t, context=context, control=control, transformer_options=transformer_options, **extra_conds)
                   ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\lumina\model.py", line 600, in forward
    return comfy.patcher_extension.WrapperExecutor.new_class_executor(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\patcher_extension.py", line 112, in execute
    return self.original(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\lumina\model.py", line 619, in _forward
    t = self.t_embedder(t * self.time_scale, dtype=x.dtype)  # (N, D)
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ldm\modules\diffusionmodules\mmdit.py", line 227, in forward
    t_emb = self.mlp(t_freq)
            ^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\container.py", line 250, in forward
    input = module(input)
            ^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1775, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\module.py", line 1786, in _call_impl
    return forward_call(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\comfy\ops.py", line 166, in forward
    return super().forward(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\ComfyUI\venv\Lib\site-packages\torch\nn\modules\linear.py", line 134, in forward
    return F.linear(input, self.weight, self.bias)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: CUDA error: HIPBLAS_STATUS_INVALID_VALUE when calling `hipblasLtMatmulAlgoGetHeuristic( ltHandle, computeDesc.descriptor(), Adesc.descriptor(), Bdesc.descriptor(), Cdesc.descriptor(), Cdesc.descriptor(), preference.descriptor(), 1, &heuristicResult, &returnedResult)`

Prompt executed in 99.26 seconds

@0xDELUXA
Copy link

0xDELUXA commented Jan 10, 2026

Can you suggest a stable RoCM and Pytorch version, I reinstalled from the latest TheRock builds and now keep getting this

As I said, I'm gfx1200-only, can't try other archs. That looks like an arch specific issue.
The latest torch can be installed with:
pip install --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ --pre torch torchaudio torchvision rocm[devel]
Then rocm-sdk init for triton-windows to work.
Activating a Visual Studio environment is also advised:
cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } }
Then:
Set ROCm paths using rocm-sdk

$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

Set compiler and build settings

$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

Enable experimental features

$env:FLASH_ATTENTION_TRITON_AMD_ENABLE = "TRUE"
$env:TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL = "1"

This is based on: https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md

@IxMxAMAR
Copy link

IxMxAMAR commented Jan 11, 2026

Can you suggest a stable RoCM and Pytorch version, I reinstalled from the latest TheRock builds and now keep getting this

As I said, I'm gfx1200-only, can't try other archs. That looks like an arch specific issue. The latest torch can be installed with: pip install --index-url https://rocm.nightlies.amd.com/v2/gfx110X-all/ --pre torch torchaudio torchvision rocm[devel] Then rocm-sdk init for triton-windows to work. Activating a Visual Studio environment is also advised: cmd /c '"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\Build\vcvars64.bat" >nul 2>&1 && set' | ForEach-Object { if ($_ -match '^([^=]+)=(.*)$') { [System.Environment]::SetEnvironmentVariable($matches[1], $matches[2], 'Process') } } Then: Set ROCm paths using rocm-sdk

$ROCM_ROOT = (rocm-sdk path --root).Trim()
$ROCM_BIN = (rocm-sdk path --bin).Trim()
$env:ROCM_HOME = $ROCM_ROOT
$env:PATH = "$ROCM_ROOT\lib\llvm\bin;$ROCM_BIN;$env:PATH"

Set compiler and build settings

$env:CC = "clang-cl"
$env:CXX = "clang-cl"
$env:DISTUTILS_USE_SDK = "1"

Enable experimental features

$env:FLASH_ATTENTION_TRITON_AMD_ENABLE = "TRUE"
$env:TORCH_ROCM_AOTRITON_ENABLE_EXPERIMENTAL = "1"

This is based on: https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md

Went through a lot and finally fixed everything, now using --use-flash-attention doesn't give any errors but the speeds aren't any better than SageAttn1.0.6 while the FlashAttn is 2.8.3 and one more thing so using these flags:

set TRITON_PRINT_AUTOTUNING=1
set TRITON_CACHE_AUTOTUNING=1

should show the triton selecting configs and things which it does with --use-sage-attention, so I doubt whether it's fully working or not :( so anyway to verify?

For comparision i get:
Standard SDPA: 7.3-7.6s/it
FlashAttn: 6.9-7s/it
SageAttn: 6.1-6.5s/it

Currently for 1088x1920 image from Z Image Turbo

Also the triton-windows3.6.0.post23 from that repo is giving that bug with the directory path length I think, which got fixed in official triton-windows3.5.1.post24

@woct0rdho
Copy link
Owner

woct0rdho commented Jan 12, 2026

I think we need to put a disclaimer somewhere: SageAttention is faster than FlashAttention (and other well-implemented fp16/bf16 attentions) only if the GPU has higher int8 performance than fp16/bf16.

GPUs with RDNA3, including RX 7000 series and Strix Halo, do not.

For GPUs with RDNA4, including RX 9000 series, there should be a way to do it.

@0xDELUXA
Copy link

A few people are asking me so I think we need to put a disclaimer somewhere: SageAttention is faster than FlashAttention (and other well-implemented fp16/bf16 attentions) only if the GPU has higher int8 performance than fp16/bf16. GPUs with RDNA3, such as RX7600 and Strix Halo, do not.

That’s why, for me, RDNA4, Sage is the better option. Also, correct me if I’m wrong, but the Sage we can use now with triton-windows is Sage v1. And Jam’s SpargeAttention PR theoretically works more like Sage v2 (it isn't ready for RNDA4 yet).

@rwfsmith
Copy link

I think we need to put a disclaimer somewhere: SageAttention is faster than FlashAttention (and other well-implemented fp16/bf16 attentions) only if the GPU has higher int8 performance than fp16/bf16.

GPUs with RDNA3, including RX 7000 series and Strix Halo, do not.

For GPUs with RDNA4, including RX 9000 series, there should be a way to do it.

I read this somewhere and I've been telling people the same thing while hoping it was correct, thank you for confirming it for me :).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

8 participants