-
Notifications
You must be signed in to change notification settings - Fork 13.5k
hip: add RDNA4 support for mmf and mma #16835
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
You can use
Yes that's correct |
|
Thank you for this PR. Right now I'm implementing Volta support in parallel. A lot of the issues w.r.t. tile shapes and compilation failures that I've encountered seem to be the same. I ended up using 32x8 + 8x8 -> 32x8 tiles (note that for A and B the lengths are in units of physical 4 byte registers so it's 32x16 + 8x16 -> 32x8 in terms of logical values). I think those tile sizes are also available for AMD WMMA and using them would be simpler. So I would suggest we either merge my PR first so that there is a more similar implementation on master that you can adapt or we try to adapt this PR as-is (I would be fine with either). |
|
(Long-term we should have support for both 32x8 and 16x16 tiles since 32x8 is better for <= 8 tokens but 16x16 needs 20% less shared memory I/O). |
Yep, tile<32,8,half2> is available for RDNA4 and RDNA3 as all AMD wmma instruction use 16x16x16 layout, honestly RDNA doesn't have TF32 support, so I add the dummy function in tile<16,8,float> to make the compiler happy. CDNA3 shall have 16x16x16 fp16 mfma, but I'm not sure the layout of it's TF32 instruction, let me try to borrow a MI when this PR is finished. Using which PR shall be submitted first doesn't matter, looks like that my PR doesn't pass the CI on Linux, if I'm able to fix the issue this week and the performance looks good, I will suggest to use my PR first. Or I will suggest to use your PR first. |
|
My PR should be almost ready, we can decide then. |
Got it, you first as I still need to get a Linux env with ROCm 6.1.2 to pass the CI. |
|
PR for Volta support: #16843 . Instead of template specializations that only exist to make the compiler happy I've made it so that the device code never tries to use them in the first place (to avoid accidental misuse). |
I got it, looks similar to my PR, you can submit first as I still need sometime to get a MI308 and benchmark the model. Honestly, based on my experience, the most compiler errors are triggered by Because CUDA and ROCm have different MMA layout, the common code like mmf, mmq and load_genertic will always have compiler trouble. Also things will be more complicated when try to accomplish flash attention for AMD, as the C matrix layout is column-major, so it needs D = B * A + C instead of D = A * B + C to handle gemm fuston. Not sure if there is a better way to handle mma tile for different hardware. |
|
Hello @JohannesGaessler May I have the download link of models you are evaluating in #16843 ? Thank you. I just use https://huggingface.co/TheBloke/Llama-2-7B-GGUF/resolve/main/llama-2-7b.Q4_0.gguf but looks like the performance doesn't change much, not sure why, maybe fp16 mmvf handles too many ops. Best Regards |
|
I used https://huggingface.co/meta-llama/Meta-Llama-3-8B and https://huggingface.co/deepseek-ai/DeepSeek-V2-Lite which I downloaded in HuggingFace format and then converted myself. |
|
In any case, you should only be seeing a performance difference for an FP16 model, for a q4_0 model the MMF kernel should not be used. |
|
I determined performance using this command ./build/bin/llama-bench --model models/opt/${model_name}-${quantization}.gguf -r 1 -fa 1 -n 0 -ub 1-16 -o sql|sqlite3 llama-bench.sqlitefor the two commits followed by |
|
Got it, thank you for the support, I can see the performance change on deepseek-r1-0528-qwen3-8b.f16.gguf, although the result isn't good, mmf is slower than hipblas, I shall spend sometime to investigate the reason. |
Add RDNA4 support for mmf, just attach the log of test-backend-ops perf.
There are some perf improvement like:
before:
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1): 994 runs - 1376.16 us/run - 100.66 MFLOP/run - �[1;34m 73.15 GFLOPS�[0m
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1): 250 runs - 4024.50 us/run - 805.31 MFLOP/run - �[1;34m200.10 GFLOPS�[0m
after:
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=4,k=2048,o=1): 7952 runs - 132.58 us/run - 100.66 MFLOP/run - �[1;34m759.26 GFLOPS�[0m
MUL_MAT_ID(type_a=f16,type_b=f32,n_mats=128,n_used=8,b=0,m=768,n=32,k=2048,o=1): 1750 runs - 581.82 us/run - 805.31 MFLOP/run - �[1;34m 1.38 TFLOPS�[0m
Risks:
Could you give me the step to measure the performance change of a real model? Thank you.
after.txt
before.txt