Commit 36ca523
authored
[CK_TILE] Update gfx11 FMHA forward kernel configs (#5088)
## Motivation
Tune gfx11 FMHA codegen to recover performance for mainly PSSK (padded
seqlen_q/k) cases.
This tuning is based on heuristic search and improves performance in
most tested shapes.
Performance should be evaluated on top of
[`#5018`](#5018)
(required baseline).
## Technical Details
- Updated gfx11 codegen heuristic choices for tile size and occupancy.
- Updated gfx11 pipeline selection:
- Disabled the `npad` (`f,f,f,f`) qr entry because it was consistently
slower than the `pssk` (`t,t,f,f`) path, and kept `pssk` enabled so npad
cases are dispatched to the faster kernel path.`
- Kept gfx12 unchanged: with PSSK support from
[`#4957`](#4957),
existing gfx12 config is already sufficient.
- Tuning rationale:
- In some cases, higher `kBlockPerCu` lowers register pressure.
- On RDNA, this generally aligns with better performance when
`waves_per_eu >= 6`.
## Test Plan
- test_ck_tile_fmha
- tile_example_fmha_fwd: tested this on gfx1100 and gfx1151
./build/bin/tile_example_fmha_fwd -prec=bf16 -mode={0/1} -b=1 -h=24
-d=128 -s={seqlen} -s_k={seqlen} -lse=0 -iperm={0/1} -operm={0/1}
## Test Result
- TFLOPs by sequence length target: `gfx1100` layout: `bhsd`
- mode: batch / VGPR usage: 225 vs 214
SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 74.10 | 71.97 | 0.97x
4096 | 66.26 | 77.79 | 1.17x
8192 | 68.18 | 75.88 | 1.11x
12288 | 68.47 | 80.44 | 1.17x
16384 | 59.54 | 79.66 | 1.34x
20480 | 55.78 | 77.91 | 1.40x
24576 | 55.08 | 77.47 | 1.41x
27280 | 47.45 | 77.16 | 1.63x
- mode: group / VGPR usage: 256 vs 214
SeqLen | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 71.47 | 70.6 | 0.99x
4096 | 64.74 | 77.06 | 1.19x
8192 | 64.68 | 75.47 | 1.17x
12288 | 66.43 | 79.95 | 1.20x
16384 | 56.02 | 79.73 | 1.42x
20480 | 50.21 | 78.15 | 1.56x
24576 | 47.29 | 77.53 | 1.64x
27280 | 46.13 | 77.04 | 1.67x
- TFLOPs by sequence length target: `gfx1151` layout: `bshd`
- mode: batch / VGPR usage: 225 vs 223
Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 26.85 | 29.17 | 1.09x
4096 | 24.75 | 26.01 | 1.05x
8192 | 25.24 | 25.50 | 1.01x
12288 | 25.18 | 25.00 | 0.99x
16384 | 24.79 | 25.91 | 1.05x
20480 | 25.56 | 25.24 | 0.99x
24576 | 25.13 | 26.20 | 1.04x
27280 | 10.78 | 26.35 | 2.44x
- mode: group / VGPR usage: 256 vs 229
Batch | Baseline | Tuned | Gain
-- | -- | -- | --
1024 | 27.44 | 26.71 | 0.97x
4096 | 21.89 | 23.09 | 1.05x
8192 | 22.85 | 24.49 | 1.07x
12288 | 24.33 | 24.42 | 1.00x
16384 | 20.05 | 24.98 | 1.24x
20480 | 14.70 | 25.15 | 1.71x
24576 | 11.30 | 26.31 | 2.33x
27280 | 10.10 | 26.32 | 2.61x
## Submission Checklist
- [ ] Look over the contributing guidelines at
https://github.com/ROCm/ROCm/blob/develop/CONTRIBUTING.md#pull-requests.1 parent feda326 commit 36ca523
File tree
4 files changed
+34
-5
lines changed- projects/composablekernel
- example/ck_tile/01_fmha/codegen/ops
- include/ck_tile
- core
- arch
4 files changed
+34
-5
lines changedLines changed: 25 additions & 5 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1095 | 1095 | | |
1096 | 1096 | | |
1097 | 1097 | | |
1098 | | - | |
| 1098 | + | |
| 1099 | + | |
| 1100 | + | |
| 1101 | + | |
1099 | 1102 | | |
1100 | 1103 | | |
1101 | 1104 | | |
| |||
1109 | 1112 | | |
1110 | 1113 | | |
1111 | 1114 | | |
1112 | | - | |
1113 | | - | |
| 1115 | + | |
| 1116 | + | |
| 1117 | + | |
| 1118 | + | |
1114 | 1119 | | |
1115 | | - | |
| 1120 | + | |
1116 | 1121 | | |
1117 | 1122 | | |
1118 | 1123 | | |
| |||
1133 | 1138 | | |
1134 | 1139 | | |
1135 | 1140 | | |
1136 | | - | |
| 1141 | + | |
| 1142 | + | |
1137 | 1143 | | |
1138 | 1144 | | |
1139 | 1145 | | |
1140 | 1146 | | |
1141 | 1147 | | |
| 1148 | + | |
| 1149 | + | |
| 1150 | + | |
| 1151 | + | |
| 1152 | + | |
| 1153 | + | |
| 1154 | + | |
| 1155 | + | |
| 1156 | + | |
| 1157 | + | |
| 1158 | + | |
| 1159 | + | |
1142 | 1160 | | |
1143 | 1161 | | |
1144 | 1162 | | |
| |||
1230 | 1248 | | |
1231 | 1249 | | |
1232 | 1250 | | |
| 1251 | + | |
| 1252 | + | |
1233 | 1253 | | |
1234 | 1254 | | |
1235 | 1255 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
23 | 23 | | |
24 | 24 | | |
25 | 25 | | |
| 26 | + | |
26 | 27 | | |
27 | 28 | | |
28 | 29 | | |
| |||
Lines changed: 5 additions & 0 deletions
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
1141 | 1141 | | |
1142 | 1142 | | |
1143 | 1143 | | |
| 1144 | + | |
| 1145 | + | |
| 1146 | + | |
1144 | 1147 | | |
1145 | 1148 | | |
1146 | 1149 | | |
| |||
1174 | 1177 | | |
1175 | 1178 | | |
1176 | 1179 | | |
| 1180 | + | |
| 1181 | + | |
1177 | 1182 | | |
1178 | 1183 | | |
1179 | 1184 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
24 | 24 | | |
25 | 25 | | |
26 | 26 | | |
| 27 | + | |
| 28 | + | |
| 29 | + | |
27 | 30 | | |
28 | 31 | | |
29 | 32 | | |
| |||
0 commit comments