AMD Port of SpargeAttn - Working on windows for gfx1151#108
AMD Port of SpargeAttn - Working on windows for gfx1151#108jammm wants to merge 1 commit intothu-ml:mainfrom
Conversation
15a0f4c to
8e7f363
Compare
|
Great to see this! |
|
A small nitpick, but for me, |
yeah. IIRC, it's slower than aotriton SDPA on larger topk values. but around 0.25 or below sparsity this SLA one does better. Note that SDPA FA is always dense attention as it doesn't support SLA |
Hmm that's strange. It should work. Can you create an issue in https://github.com/ROCm/TheRock with repro steps? |
Ready: ROCm/TheRock#2726 |
|
For me, the test script from
If I change the dtype from Is this behavior expected on the gfx1200, or is there a problem with my build? |
Thanks for the catch! I've fixed the dtype issue. However, this code isn't implemented for rdna4 yet. It's just rdna3 for now. Some modifications need to be made specifically for rdna4 to work with rocWMMA as the matrix fragment layouts for the individual elements are different. But I don't have access to an RDNA4 GPU at the moment. |
8e7f363 to
1523334
Compare
I see. No worries, I can wait until you have an RDNA4 GPU available. |
|
Correct me if I'm wrong, but theoretically RDNA4 can use fp8, so we would need Based on: in |
Yup. Not sure about perf vs. SDPA as aotriton should have fp8 kernels too, I think? |
Just checked. My bad, I forgot it was deprecated. I'll modify the README's to only specify the torch installation as that should automatically download the corresponding rocm wheels as a dependency. |
1523334 to
7a2278c
Compare
Don't really understand these things, so I think you're right. All I know is over at Nvidia, Sparge is better than SPDA Flash in these type of workloads. |
rocm-sdk-devel didn't include the tar file for me, but this worked: pip install --index-url https://rocm.nightlies.amd.com/v2/gfx1151/ "rocm[libraries,devel]" |
|
ran this to build on linux: it builds successfully, but my cosine similarity is low Cosine similarity: 0.998755 |
You mean on Linux? |
this is fine I believe. Give it a try on some image/video model to confirm. |
What I wanted to say: 0.998... should be fine, it’s very close to 100. Unlike the value I got: 0.467271 (because of RNDA4). I hope this PR will be optimized for the RX 9000 series in the near future. |
|
Hello, I have Rx7600 and built this with similar cosine similarity around 0.998, currently using this in ComfyUI works like a charm well topk values below 0.2 mess things up while 0.25 topk give similar results to sageattn1.06 in quality and speed so it is a great alternative for sageattn on AMD i guess. |
That was initially on Linux, but I ran it again when I rebooted into Windows and using "rocm-sdk-devel" works there, but rocm[devel] worked on both. Funny enough, though, I was able to build this pretty easily on Linux but having trouble in Windows :P. |
how would I enable it for use in ComfyUI? |
|
Hello. I have successfully compiled the SpargeAttn AMD you provided on Linux. My graphics card is 9070xt. After selecting SpargeAttn for the workflow under Comfyui and selecting sparse_stage as the parameter, the K-sampling error is as follows. It can also run the same workflow with the same configuration on Windows. I don't know if it can solve this problem. Thank @jammm ComfyUI Error ReportError Details
Stack TraceSystem Information
Devices
LogsAttached WorkflowPlease make sure that workflow does not contain any sensitive information such as API keys or passwords. Additional Context(Please add any additional context or steps to reproduce the error here) |
We can't yet use Sparge on RDNA4, I mean we can, but it will be bad (based on my cosine result). Need to wait for Jam to optimize this PR for RDNA4. |
in truth. However, Comfyui can run on Windows and is faster than Sageatt, although the image quality has decreased. |
My RX 7900XT is also displaying the same error in Windows. |
|
Indeed, the RX 7900 XT doesn't support FP8, but @ouco1986's RX 9070 XT does. The error specifically says:
Also,
So theoretically gfx1100 (RX 7900 XT) should work - in TurboDiffusion, where it was mainly tested. I think that in the future there might be a conditional in the code, so if the GPU isn’t RDNA4 (or CDNA), it should use the FP16 path; otherwise, FP8. |
|
Unfortunately I have not worked on getting this running on ComfyUI yet, but It seems like the wrapper needs to be refactored a tiny bit to use the fp16 specific code path instead of the fp8 one. And yes, only RDNA3/3.5 supported as @0xDELUXA rightly pointed out. |
|
@githust66 if you have sageattention installed which I think isn't needed either way, just install KJnodes and replace your model_optimization_nodes.py in C:\ComfyUI\custom_nodes\comfyui-kjnodes\nodes with this and then you will have an option to use this in ComfyUI with the Patch sage attention node, just select spas_sage2_attn from the list and put in the topk value you want. edit: I removed the code for other sage attention patches, but that won't matter since they don't work on AMD eitherway IG |
|
Theoretically, ROCm/rocm-libraries#3579 also affects SpargeAttn with ROCm 7, which could mean it can be even faster than it is now. |
|
Where can I find this comfyui wrapper for SpargeAttn? Is it on GH? |
Would be good to know, so you could work on getting this running with Comfy, now that AMD support in ComfyUI is official. |
https://github.com/kijai/ComfyUI-WanVideoWrapper/tree/main/wanvideo/radial_attention |
|
@jammm Hello, I found that in the KJ workflow in Comfyui, deleting the two kernel files of SageAttention can run without errors. I don't know if this information will be helpful for future work. |
If you simply wanna play with topk values and use it on all the models, the modified python file is still better, the radial attention node doesn't work with some models, like lumina2 |
|
Tried to make use of https://github.com/eliotwang/sgattn_rocwmma2.0 to make SpargeAttention compatible with RDNA4 / FP8, but couldn’t get it to work... |
This is the one I used, since it works with the native nodes. https://github.com/woct0rdho/ComfyUI-RadialAttn |
Tried in comfyUI with Linux, it works! |
Using the latest |
Yea, only Sage Attention 1.0.6 is supported. |
Use rocWMMA instead of CUTLASS.
See README_AMD_WINDOWS.md for setup steps.
Heavily inspired from thu-ml/SageAttention#332
Used claude opus 4.5 to assist.
Tested with TurboDiffusion
Currently only supports RDNA3/3.5