Add Windows/clang-cl support for AMD HIP backend#179
Add Windows/clang-cl support for AMD HIP backend#179woct0rdho merged 1 commit intowoct0rdho:release/3.5.x-windowsfrom
Conversation
- Use LoadLibrary/GetProcAddress on Windows instead of dlopen/dlsym - Use rocm_sdk.find_libraries() to locate amdhip64 - Add platform-specific macros for dynamic library loading - Escape Windows paths for C string embedding - Treat clang-cl as MSVC-compatible compiler in build.py - Fix NamedTemporaryFile handling on Windows in compiler.py
|
Looks good to me! I haven't followed the modern AMD toolchain for a while, but if this is enough to make it work, then it will not add much maintenance cost. Maybe you can also tell people at https://github.com/patientx/ComfyUI-Zluda and https://github.com/lshqqytiger/triton about this. |
|
I've cherry-picked this onto the upcoming |
|
Also, can you test the Triton 3.5 wheel at https://github.com/Comfy-Org/wheels/actions/runs/20599014618 , in the way that users would install it? If it works, I'll publish it to PyPI. |
Just tried generated_video.mp4 |
|
The python test examples from Triton finally work with my gfx1100 But when I try torch.compile via inductor in comfyui, it fails:
But this seems to be a bug with llvm. |
How did you achieve this? I have an Rx7600 and can I do this? can you share your ComfyUI-run.bat or the args and env variables you are using? and is it Comfy-UI official or ZLUDA? |
First of all ,thanks for working on this , much appricated. set CC=clang-cl since rocm is installed as package that last one should work right ? |
|
Haven’t tried building sage attention but you could follow the environment
variable setup here
https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md#initialize-rocm-sdk
…On Saturday, January 3, 2026, patientx ***@***.***> wrote:
*patientx* left a comment (woct0rdho/triton-windows#179)
<#179 (comment)>
First of all ,thanks for working on this , much appricated.
Does installing it like this ok since it seems like you added it : " "pip
install triton-windows" which installs "triton-windows 3.5.1.post23" I then
installed sage-attention with "pip install sageattention==1.0.6"and
flash-attention also. But in the end both gave errors. I set up the
parameters like this in starter batch for comfy :
set CC=clang-cl
set CXX=clang-cl
set DISTUTILS_USE_SDK=1
for /f "delims=" %%i in ('python -c "import rocm; print(rocm.*path*[0])"')
do set ROCM_HOME=%%i
since rocm is installed as package that last one should work right ?
—
Reply to this email directly, view it on GitHub
<#179 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AATCSOEKRMS3ZPUSJ323LM34E7LGPAVCNFSM6AAAAACQKA7ZBWVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZTOMBXGEYTKNJVGM>
.
You are receiving this because you authored the thread.Message ID:
***@***.***>
|
I’m curious about Sage too... |
It turns out I only needed to run rocm-sdk init" after activating venv. It works if this is done in commandline only with batch files too. Torch.compile now works but generates black output also sageattention as deluxa says doesn't work. edit : sage-attention works with the patches on my comfyui-zluda fork. |
I also compiled SpargeAttn but how exactly am i supposed to use it with comfyUI, any idea |
I don't know what changes your fork has, but why don't you make a PR here so everyone can use SageAttention? I'll try Sage myself soon too. What about the performance vs SDPA flash? |
On my rx6800 sdpa is the slowest , slower than quad-cross which I was using by default. Sage attention "patches" were made by someone in sdnext discord and I was applying them every install with zluda setup , interestingly I didn't need them with lee's torch builds from may 2024. (they were using hip 6.5 though) Now it seems the same patches also work here. Just replace these three files actually here are curl commands to apply them directly when venv is activated and inside comfyui directory.
|
Int8? I think these are for RDNA2 or 3, don't think RDNA4 needs them. Will try soon tho Edit: Yes, it does. I’m getting a lot of errors coming from |
|
Yes, that works. It did do a nasty crash the first time, which I am saving here not as a complaint, but as a reference for a possible side project: "Write a program to force an AMD driver crash in order to free up all that VRAM that dwm never gives back." FYI the ZLUDA sageattn is basically just a patch to change the parameters to Otherwise it uses too much "shared memory" and produces black screens. See also https://raw.githubusercontent.com/sfinktah/amd-torch/refs/heads/main/patches/sageattention-1.0.6+sfinktah+env-py3-none-any.patch which is an environment variable adjustable version of the same thing. |
I have tried many many combinations and besides these and none of them worked, most of the time i got garbled noise instead of image and sometimes i got a black and white frame for what was supposed to be the subject, i have an Rx7600 |
|
I was able to use this to get Flash Attention 2 running in ComfyUI on Windows. I ran some SDXL performance tests with pretty decent results. Using an SDXL fine-tune using a DMD2 lora and an upscaler KSampler step Compared using flash attention with pytorch cross attention across 10 image generations, listing the average it/s for both the base and upscaler samplers at the end.
edit: I also just tested with sage attention 1, but the results seem to be the same as cross attention |
how did you do that? I have also compiled Sparge and Sage but haven't tried Flash yet, Flash 2 seems even better so how can I? |
I was also able, based on my results Flash 2 is slower than AOTriton SPDA Flash on RDNA4. |
just followed the AMD steps in the Flash Attention repo readme file, it worked pretty well. I've been having trouble getting the Sparge/Sage stuff working after building it.
I think that's because AOTriton has FP8 kernels for RDNA4? I think I read something about that in another thread. That may have been about why it's faster than SageAttention 1, but that could be a similar cause. I'm mostly interested in using Flash Attention for training, since it's supposed to speed it up quite a bit and help reduce memory usage. |
I think so too. Curious how will Sparge work on RNDA4 compared to the other attention mechanisms. For ex. in TurboDiffusion. |
are you on linux? cuz i have tried building it so many times on windows and somehow the link.exe just fails to link all the files |
It works on Windows too. Don't forget to set |
|
Odd. It only took a few seconds for mine to build since it shouldn't be building the kernels, instead using the triton-windows kernel. Here are the exact steps I follow: Open command prompt and clone repo (I prefer command prompt over powershell) create rocmvariables.bat (not sure which of these are actually needed, but they work for me), save to fa folder create and activate venv, install nightly packages install triton-windows 3.6.0 package from this page run build Should just take a few seconds. On a side note, TheRock windows nightly builds are working again and my image generates times shaved off around 1/5th of the time with these new ones compared to the last time it built back in mid-December. The initial image generation always seems to take a good 20 seconds longer than the subsequent ones, partially from loading the model, but also seems like the VAE decode steps take longer the first time after comfyui is started. |
|
I followed your steps and it worked, built in a few seconds |
Yeah, autotune only works on CDNA AFAIK. |
|
AMD's implementation of flash-attn for Triton was conspicuously missing code that would make it work much better (or at all) on RDNA. I had a play adding the required code a while back (based on the older PatientX/Zluda version) but ultimately I was just making up numbers that looked like they might be right (and very probably weren't). See these patches: https://github.com/sfinktah/amd-torch/blob/main/patches/flash_attn-2.8.1%2Btriton_amd_git2df1727-py3-none-any.patch You'll note the addition of an |
Yeah, I was trying to use it to train a model and getting some odd errors. Seemed fine with diffusion models, I think? Sometimes comfyui will silently fallback to other methods when there's failures, though. C:\projects\training\venv\Lib\site-packages\flash_attn\flash_attn_triton_amd\fwd_prefill.py:242:0: error: Failures have been detected while processing an MLIR pass pipeline |
|
@rwfsmith I confess I've never used flash-attention except as a way to save VRAM (and have WVW use sage-attention anyway). On Nvidia, it takes overnight to compile, and it still has a strange run-time warning. It's nice to know flash-attention is difficult for everyone, and it's not just an AMD thing. Did those errors appear with my patches, with the ZLUDA/PatientX version, or some other variation? One thing I learned while optimising various attentions is that some errors are fine as long as it picks a combination of parameters that result in a valid result. Quite possibly some of those errors could be a result of badly chosen parameters and if someone took the time to work out what was valid for what GPU then they might go away, but things tend to sort themselves out in the end. The exception being when something invalid DOESN'T generate an error, and you end up with all-black output. BLOCK_M: 64, BLOCK_N:16 is an example of this for sage attention when running under TheRock (but works fine, and is faster than 32x16 when using ZLUDA). Also, I like to have these two environment variables set, it help you understand things a little more, and probably goes faster (assuming CACHE isn't default on, which I do not believe it is). In (for example) the patch I linked above, you can see a bunch of numbers that some AI said might be good (don't trust AI for such things, they're horrific). You can also see a comment that I pasted out of my comy log, showing which one it selected. |
how did you even get it working, I have an Rx7600 and with these params- a bunch of these and with So I give up even considering running it, PS this is the whole of my Launch.bat |
Did you build Flash2 as: ROCm/TheRock#1278 (comment)?
Also. we don't have things like:
|
Can you suggest a stable RoCM and Pytorch version, I reinstalled from the latest TheRock builds and now keep getting this |
As I said, I'm gfx1200-only, can't try other archs. That looks like an arch specific issue. Set compiler and build settings Enable experimental features This is based on: https://github.com/jammm/SpargeAttn/blob/jam/amd_windows/README_AMD_WINDOWS.md |
Went through a lot and finally fixed everything, now using --use-flash-attention doesn't give any errors but the speeds aren't any better than SageAttn1.0.6 while the FlashAttn is 2.8.3 and one more thing so using these flags: should show the triton selecting configs and things which it does with --use-sage-attention, so I doubt whether it's fully working or not :( so anyway to verify? For comparision i get: Currently for 1088x1920 image from Z Image Turbo Also the triton-windows3.6.0.post23 from that repo is giving that bug with the directory path length I think, which got fixed in official triton-windows3.5.1.post24 |
|
I think we need to put a disclaimer somewhere: SageAttention is faster than FlashAttention (and other well-implemented fp16/bf16 attentions) only if the GPU has higher int8 performance than fp16/bf16. GPUs with RDNA3, including RX 7000 series and Strix Halo, do not. For GPUs with RDNA4, including RX 9000 series, there should be a way to do it. |
That’s why, for me, RDNA4, Sage is the better option. Also, correct me if I’m wrong, but the Sage we can use now with triton-windows is Sage v1. And Jam’s SpargeAttention PR theoretically works more like Sage v2 (it isn't ready for RNDA4 yet). |
I read this somewhere and I've been telling people the same thing while hoping it was correct, thank you for confirming it for me :). |

This allows triton running on AMD GPUs on windows via. TheRock wheels - https://github.com/ROCm/TheRock/blob/main/RELEASES.md
It should build as-is with the same build process as @woct0rdho as it only modifies .py and a .c file that's compiled at runtime.
Whenever you run a program that requires triton, make sure to set the following environment variables:
CCandCXXtoclang-clROCM_HOMEtorocm-sdk path --root;$PATHDISTUTILS_USE_SDK=1Summary of changes: