Skip to content

Conversation

JohannesGaessler
Copy link
Collaborator

@JohannesGaessler JohannesGaessler commented Oct 9, 2025

Changes:

  • Generalized the tile CUDA FlashAttention kernel with support for essentially arbitrary head sizes (in particular 40 for Stable Diffusion and 576/512 for Deepseek) as well as arbitrary context sizes (for optimal performance this should still be padded to a multiple of 256, long term this can be lowered to 128). The tile kernel is now used as a fallback if the other kernels cannot be used for whatever reason. I intend to also add support for non-padded ne11 to the mma kernel.
  • Added the same GQA optimizations from the mma kernel to the tile kernel which reduces I/O for the mask and increases arithmetic intensity for small batch sizes. In order to keep the number of kernel specializations low I'm using the same strategy where I put support for optional features into the version without GQA (ALiBi, no mask, non-padded KV cache). The GQA optimizations require additional integer divisions which as of yet are still done without fastdiv in the FA kernels, because of this there are some combinations of GPUs, models, and batch sizes where there is a 1-2% performance regression. I intend to add fastdiv once I have removed the WMMA kernel, I expect this to be fixed then. Also note that the granularity in terms of tokens is now being reduced by a factor equal to the GQA ratio so even in those cases there is now slightly less wasted compute.
  • Added support for multiple parallel warps per Q column in order to improve performance for small batch sizes. With this additional optimization the tile kernel seems to now be a better choice for batch size 1 than the vector kernel, particularly for AMD hardware.
  • Fixed a bug in common.cuh where if one were to compile code for CC 6.1 and then run it on a device with CC >= 7.0 FAST_FP16_AVAILABLE and fast_fp16_available could be inconsistent.
Performance
GPU Model n_gqa Microbatch size Test t/s master t/s a2672e3 Speedup
MI50 gemma 2B Q4_0 8 1 pp16384 167.96 198.25 1.18
MI50 gemma 2B Q4_0 8 2 pp16384 176.09 352.99 2.00
MI50 gemma 2B Q4_0 8 4 pp16384 192.18 369.27 1.92
MI50 gemma 2B Q4_0 8 8 pp16384 202.19 497.16 2.46
MI50 gemma 2B Q4_0 8 16 pp16384 723.23 736.30 1.02
MI50 gemma 2B Q4_0 8 32 pp16384 919.54 924.45 1.01
MI50 gemma 2B Q4_0 8 64 pp16384 1059.55 1077.96 1.02
MI50 gemma 2B Q4_0 8 128 pp16384 1686.29 1687.00 1.00
MI50 gemma 2B Q4_0 8 256 pp16384 2129.54 2161.97 1.02
MI50 gemma 2B Q4_0 8 512 pp16384 2301.56 2358.12 1.02
MI50 gemma 2B Q4_0 8 1024 pp16384 2495.75 2555.49 1.02
MI50 gemma 2B Q4_0 8 2048 pp16384 2553.87 2623.85 1.03
MI50 gemma3 1B Q4_0 4 1 pp16384 164.69 202.31 1.23
MI50 gemma3 1B Q4_0 4 2 pp16384 210.73 397.85 1.89
MI50 gemma3 1B Q4_0 4 4 pp16384 321.87 528.05 1.64
MI50 gemma3 1B Q4_0 4 8 pp16384 488.20 749.39 1.53
MI50 gemma3 1B Q4_0 4 16 pp16384 1067.05 1060.52 0.99
MI50 gemma3 1B Q4_0 4 32 pp16384 1466.17 1470.02 1.00
MI50 gemma3 1B Q4_0 4 64 pp16384 1786.39 1787.31 1.00
MI50 gemma3 1B Q4_0 4 128 pp16384 2902.76 2920.32 1.01
MI50 gemma3 1B Q4_0 4 256 pp16384 4668.56 4721.55 1.01
MI50 gemma3 1B Q4_0 4 512 pp16384 5672.12 5747.29 1.01
MI50 gemma3 1B Q4_0 4 1024 pp16384 6558.17 6697.35 1.02
MI50 gemma3 1B Q4_0 4 2048 pp16384 6883.96 7051.50 1.02
MI50 llama 1B Q4_0 4 1 pp16384 242.57 278.66 1.15
MI50 llama 1B Q4_0 4 2 pp16384 353.90 543.58 1.54
MI50 llama 1B Q4_0 4 4 pp16384 371.55 606.71 1.63
MI50 llama 1B Q4_0 4 8 pp16384 432.43 911.49 2.11
MI50 llama 1B Q4_0 4 16 pp16384 1059.64 1088.76 1.03
MI50 llama 1B Q4_0 4 32 pp16384 1331.31 1424.42 1.07
MI50 llama 1B Q4_0 4 64 pp16384 1583.11 1632.63 1.03
MI50 llama 1B Q4_0 4 128 pp16384 2286.05 2402.55 1.05
MI50 llama 1B Q4_0 4 256 pp16384 2826.65 3022.98 1.07
MI50 llama 1B Q4_0 4 512 pp16384 3185.15 3460.37 1.09
MI50 llama 1B Q4_0 4 1024 pp16384 3357.02 3739.09 1.11
MI50 llama 1B Q4_0 4 2048 pp16384 3421.31 3845.73 1.12
MI50 qwen3 0.6B Q4_0 2 1 pp16384 162.20 184.57 1.14
MI50 qwen3 0.6B Q4_0 2 2 pp16384 253.97 368.12 1.45
MI50 qwen3 0.6B Q4_0 2 4 pp16384 269.24 539.88 2.01
MI50 qwen3 0.6B Q4_0 2 8 pp16384 276.55 828.88 3.00
MI50 qwen3 0.6B Q4_0 2 16 pp16384 1021.37 1081.49 1.06
MI50 qwen3 0.6B Q4_0 2 32 pp16384 1292.01 1299.16 1.01
MI50 qwen3 0.6B Q4_0 2 64 pp16384 1539.11 1587.87 1.03
MI50 qwen3 0.6B Q4_0 2 128 pp16384 2178.27 2304.22 1.06
MI50 qwen3 0.6B Q4_0 2 256 pp16384 2817.67 3097.36 1.10
MI50 qwen3 0.6B Q4_0 2 512 pp16384 2975.17 3316.98 1.11
MI50 qwen3 0.6B Q4_0 2 1024 pp16384 3140.11 3577.30 1.14
MI50 qwen3 0.6B Q4_0 2 2048 pp16384 3197.84 3675.02 1.15
RX 6800 gemma 2B Q4_0 8 1 pp16384 130.83 153.47 1.17
RX 6800 gemma 2B Q4_0 8 2 pp16384 144.47 282.76 1.96
RX 6800 gemma 2B Q4_0 8 4 pp16384 184.49 444.55 2.41
RX 6800 gemma 2B Q4_0 8 8 pp16384 204.98 569.48 2.78
RX 6800 gemma 2B Q4_0 8 16 pp16384 627.12 704.11 1.12
RX 6800 gemma 2B Q4_0 8 32 pp16384 982.44 1030.58 1.05
RX 6800 gemma 2B Q4_0 8 64 pp16384 1277.53 1322.01 1.03
RX 6800 gemma 2B Q4_0 8 128 pp16384 1572.44 1617.47 1.03
RX 6800 gemma 2B Q4_0 8 256 pp16384 1798.28 1861.54 1.04
RX 6800 gemma 2B Q4_0 8 512 pp16384 1965.40 2042.51 1.04
RX 6800 gemma 2B Q4_0 8 1024 pp16384 2044.63 2132.01 1.04
RX 6800 gemma 2B Q4_0 8 2048 pp16384 2057.27 2142.36 1.04
RX 6800 gemma3 1B Q4_0 4 1 pp16384 145.73 167.24 1.15
RX 6800 gemma3 1B Q4_0 4 2 pp16384 228.12 325.23 1.43
RX 6800 gemma3 1B Q4_0 4 4 pp16384 389.37 573.70 1.47
RX 6800 gemma3 1B Q4_0 4 8 pp16384 572.42 852.88 1.49
RX 6800 gemma3 1B Q4_0 4 16 pp16384 1026.18 1044.69 1.02
RX 6800 gemma3 1B Q4_0 4 32 pp16384 1676.56 1679.82 1.00
RX 6800 gemma3 1B Q4_0 4 64 pp16384 2335.22 2421.80 1.04
RX 6800 gemma3 1B Q4_0 4 128 pp16384 3635.14 3724.20 1.02
RX 6800 gemma3 1B Q4_0 4 256 pp16384 4423.00 4520.34 1.02
RX 6800 gemma3 1B Q4_0 4 512 pp16384 5197.50 5324.55 1.02
RX 6800 gemma3 1B Q4_0 4 1024 pp16384 5534.11 5707.96 1.03
RX 6800 gemma3 1B Q4_0 4 2048 pp16384 5416.10 5608.01 1.04
RX 6800 llama 1B Q4_0 4 1 pp16384 161.23 203.56 1.26
RX 6800 llama 1B Q4_0 4 2 pp16384 274.30 392.64 1.43
RX 6800 llama 1B Q4_0 4 4 pp16384 388.18 661.61 1.70
RX 6800 llama 1B Q4_0 4 8 pp16384 474.39 879.80 1.85
RX 6800 llama 1B Q4_0 4 16 pp16384 964.20 993.17 1.03
RX 6800 llama 1B Q4_0 4 32 pp16384 1366.04 1293.80 0.95
RX 6800 llama 1B Q4_0 4 64 pp16384 1758.54 1783.32 1.01
RX 6800 llama 1B Q4_0 4 128 pp16384 2111.00 2169.63 1.03
RX 6800 llama 1B Q4_0 4 256 pp16384 2438.34 2474.46 1.01
RX 6800 llama 1B Q4_0 4 512 pp16384 2552.97 2635.47 1.03
RX 6800 llama 1B Q4_0 4 1024 pp16384 2633.76 2806.73 1.07
RX 6800 llama 1B Q4_0 4 2048 pp16384 2644.12 2831.54 1.07
RX 6800 qwen3 0.6B Q4_0 2 1 pp16384 121.95 133.86 1.10
RX 6800 qwen3 0.6B Q4_0 2 2 pp16384 209.27 261.31 1.25
RX 6800 qwen3 0.6B Q4_0 2 4 pp16384 293.08 497.67 1.70
RX 6800 qwen3 0.6B Q4_0 2 8 pp16384 380.72 770.64 2.02
RX 6800 qwen3 0.6B Q4_0 2 16 pp16384 835.10 935.36 1.12
RX 6800 qwen3 0.6B Q4_0 2 32 pp16384 1285.99 1345.56 1.05
RX 6800 qwen3 0.6B Q4_0 2 64 pp16384 1711.28 1662.70 0.97
RX 6800 qwen3 0.6B Q4_0 2 128 pp16384 2160.39 2110.09 0.98
RX 6800 qwen3 0.6B Q4_0 2 256 pp16384 2395.22 2350.87 0.98
RX 6800 qwen3 0.6B Q4_0 2 512 pp16384 2588.48 2531.94 0.98
RX 6800 qwen3 0.6B Q4_0 2 1024 pp16384 2710.27 2680.50 0.99
RX 6800 qwen3 0.6B Q4_0 2 2048 pp16384 2746.50 2704.52 0.98
P40 gemma 2B Q4_0 8 1 pp16384 136.46 145.68 1.07
P40 gemma 2B Q4_0 8 2 pp16384 257.59 283.79 1.10
P40 gemma 2B Q4_0 8 4 pp16384 321.20 356.73 1.11
P40 gemma 2B Q4_0 8 8 pp16384 453.33 521.46 1.15
P40 gemma 2B Q4_0 8 16 pp16384 827.93 923.70 1.12
P40 gemma 2B Q4_0 8 32 pp16384 1182.04 1187.99 1.01
P40 gemma 2B Q4_0 8 64 pp16384 1435.17 1434.64 1.00
P40 gemma 2B Q4_0 8 128 pp16384 1577.10 1567.24 0.99
P40 gemma 2B Q4_0 8 256 pp16384 1690.60 1669.24 0.99
P40 gemma 2B Q4_0 8 512 pp16384 1746.00 1749.15 1.00
P40 gemma 2B Q4_0 8 1024 pp16384 1816.87 1835.29 1.01
P40 gemma 2B Q4_0 8 2048 pp16384 1832.82 1845.11 1.01
P40 gemma3 1B Q4_0 4 1 pp16384 176.32 178.80 1.01
P40 gemma3 1B Q4_0 4 2 pp16384 349.08 391.32 1.12
P40 gemma3 1B Q4_0 4 4 pp16384 507.38 528.53 1.04
P40 gemma3 1B Q4_0 4 8 pp16384 789.66 811.80 1.03
P40 gemma3 1B Q4_0 4 16 pp16384 1579.54 1574.54 1.00
P40 gemma3 1B Q4_0 4 32 pp16384 2415.65 2271.73 0.94
P40 gemma3 1B Q4_0 4 64 pp16384 3250.38 3161.71 0.97
P40 gemma3 1B Q4_0 4 128 pp16384 4229.45 4103.76 0.97
P40 gemma3 1B Q4_0 4 256 pp16384 4751.62 4694.10 0.99
P40 gemma3 1B Q4_0 4 512 pp16384 5094.48 5008.90 0.98
P40 gemma3 1B Q4_0 4 1024 pp16384 5299.97 5233.38 0.99
P40 gemma3 1B Q4_0 4 2048 pp16384 5076.41 4991.79 0.98
P40 llama 1B Q4_0 4 1 pp16384 213.29 227.39 1.07
P40 llama 1B Q4_0 4 2 pp16384 383.43 445.89 1.16
P40 llama 1B Q4_0 4 4 pp16384 471.83 581.99 1.23
P40 llama 1B Q4_0 4 8 pp16384 636.59 850.52 1.34
P40 llama 1B Q4_0 4 16 pp16384 1218.00 1325.84 1.09
P40 llama 1B Q4_0 4 32 pp16384 1758.38 1734.19 0.99
P40 llama 1B Q4_0 4 64 pp16384 2092.75 2068.77 0.99
P40 llama 1B Q4_0 4 128 pp16384 2336.87 2301.87 0.99
P40 llama 1B Q4_0 4 256 pp16384 2533.50 2488.52 0.98
P40 llama 1B Q4_0 4 512 pp16384 2584.86 2541.65 0.98
P40 llama 1B Q4_0 4 1024 pp16384 2656.00 2621.65 0.99
P40 llama 1B Q4_0 4 2048 pp16384 2677.45 2647.84 0.99
P40 qwen3 0.6B Q4_0 4 1 pp16384 135.82 138.74 1.02
P40 qwen3 0.6B Q4_0 4 2 pp16384 246.87 287.76 1.17
P40 qwen3 0.6B Q4_0 4 4 pp16384 384.14 426.13 1.11
P40 qwen3 0.6B Q4_0 4 8 pp16384 518.79 686.87 1.32
P40 qwen3 0.6B Q4_0 4 16 pp16384 991.77 1137.43 1.15
P40 qwen3 0.6B Q4_0 4 32 pp16384 1355.88 1385.99 1.02
P40 qwen3 0.6B Q4_0 4 64 pp16384 1534.17 1654.65 1.08
P40 qwen3 0.6B Q4_0 4 128 pp16384 1662.60 1824.45 1.10
P40 qwen3 0.6B Q4_0 4 256 pp16384 1758.51 1967.01 1.12
P40 qwen3 0.6B Q4_0 4 512 pp16384 1811.06 2056.33 1.14
P40 qwen3 0.6B Q4_0 4 1024 pp16384 1853.22 2089.37 1.13
P40 qwen3 0.6B Q4_0 4 2048 pp16384 1867.47 2096.54 1.12

@github-actions github-actions bot added Nvidia GPU Issues specific to Nvidia GPUs python python script changes ggml changes relating to the ggml tensor library for machine learning labels Oct 9, 2025
@ggerganov
Copy link
Member

Been doing some tests with this branch and haven't noticed any problems so far.

Copy link
Collaborator

@IMbackK IMbackK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can confirm the performance changes on gfx1030, and found no issues on brief testing.
From static analysis it looks correct but its a bit difficult to follow what the changes to the code in fattn-tile.cu are since this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

@JohannesGaessler
Copy link
Collaborator Author

this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

I agree but in this case the changes to the kernel itself were relatively large anyways so I think it will need to be read in full either way. Generally speaking, would you prefer I link the relevant WIP branches in cases like this?

@JohannesGaessler JohannesGaessler merged commit 11f0af5 into ggml-org:master Oct 11, 2025
71 checks passed
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 12, 2025
@IMbackK
Copy link
Collaborator

IMbackK commented Oct 13, 2025

this pr includes organizational and functional code changes in one commit, which i would prefer be avoided.

I agree but in this case the changes to the kernel itself were relatively large anyways so I think it will need to be read in full either way. Generally speaking, would you prefer I link the relevant WIP branches in cases like this?

Ideally a pr like this should simply have 2 commits, one with the organizational changes and one with the functional changes. If that is impractical due to how the changes came about, yes a note that where intermediate states can be looked at would help.

Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025
gabe-l-hart added a commit to gabe-l-hart/llama.cpp that referenced this pull request Oct 13, 2025
* origin/master: (32 commits)
metal : FA support F32 K and V and head size = 32 (ggml-org#16531)
graph : support cacheless embeddings with FA and iSWA (ggml-org#16528)
opencl: fix build targeting CL 2 (ggml-org#16554)
CUDA: fix numerical issues in tile FA kernel (ggml-org#16540)
ggml : fix build broken with -march=armv9-a on MacOS (ggml-org#16520)
CANN: fix CPU memory leak in CANN backend (ggml-org#16549)
fix: add remark plugin to render raw HTML as literal text (ggml-org#16505)
metal: add support for opt_step_sgd (ggml-org#16539)
ggml : fix scalar path for computing norm (ggml-org#16558)
CANN: Update several operators to support FP16 data format (ggml-org#16251)
metal : add opt_step_adamw and op_sum (ggml-org#16529)
webui: remove client-side context pre-check and rely on backend for limits (ggml-org#16506)
[SYCL] fix UT fault cases: count-equal, argsort, pad OPs (ggml-org#16521)
ci : add Vulkan on Ubuntu with default packages build (ggml-org#16532)
common : handle unicode during partial json parsing (ggml-org#16526)
common : update presets (ggml-org#16504)
ggml : Fix FP16 ELU positive branch (ggml-org#16519)
hparams : add check for layer index in is_recurrent (ggml-org#16511)
ggml: Correct SVE implementation in ggml_vec_dot_f16_unroll (ggml-org#16518)
CUDA: faster tile FA, add oob checks, more HSs (ggml-org#16492)
...
Nexesenex added a commit to Nexesenex/croco.cpp that referenced this pull request Oct 13, 2025
yael-works pushed a commit to yael-works/llama.cpp that referenced this pull request Oct 15, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ggml changes relating to the ggml tensor library for machine learning Nvidia GPU Issues specific to Nvidia GPUs python python script changes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants