Commit d7a9234
authored
perf: prefetch page indices for mla kernel (flashinfer-ai#991)
Followup of flashinfer-ai#952
cc @abcdabcd987
## Before this PR
```
Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1509.87 GB/s
FLOPs: 163.25 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1766.19 GB/s
FLOPs: 345.46 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2307.97 GB/s
FLOPs: 249.55 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 1975.24 GB/s
FLOPs: 386.35 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2871.63 GB/s
FLOPs: 310.49 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 2225.07 GB/s
FLOPs: 435.21 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 1948.15 GB/s
FLOPs: 222.38 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 1973.36 GB/s
FLOPs: 426.74 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2625.63 GB/s
FLOPs: 299.72 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 2121.92 GB/s
FLOPs: 458.86 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2996.11 GB/s
FLOPs: 342.01 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 2146.40 GB/s
FLOPs: 464.16 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2717.28 GB/s
FLOPs: 323.71 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 2129.24 GB/s
FLOPs: 500.04 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 3002.75 GB/s
FLOPs: 357.72 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 2101.93 GB/s
FLOPs: 493.63 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 3083.42 GB/s
FLOPs: 367.33 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 2064.96 GB/s
FLOPs: 484.95 TFLOPs
```
## After this PR
```
Config: batch_size=64, seq_len=1024, num_heads=64
Memory bandwidth: 1596.98 GB/s
FLOPs: 172.67 TFLOPs
Config: batch_size=64, seq_len=1024, num_heads=128
Memory bandwidth: 1685.22 GB/s
FLOPs: 329.62 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=64
Memory bandwidth: 2280.49 GB/s
FLOPs: 246.58 TFLOPs
Config: batch_size=128, seq_len=1024, num_heads=128
Memory bandwidth: 1917.53 GB/s
FLOPs: 375.06 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=64
Memory bandwidth: 2869.03 GB/s
FLOPs: 310.21 TFLOPs
Config: batch_size=768, seq_len=1024, num_heads=128
Memory bandwidth: 2208.35 GB/s
FLOPs: 431.94 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=64
Memory bandwidth: 2047.44 GB/s
FLOPs: 233.72 TFLOPs
Config: batch_size=64, seq_len=2048, num_heads=128
Memory bandwidth: 1936.08 GB/s
FLOPs: 418.67 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=64
Memory bandwidth: 2617.48 GB/s
FLOPs: 298.79 TFLOPs
Config: batch_size=128, seq_len=2048, num_heads=128
Memory bandwidth: 2105.97 GB/s
FLOPs: 455.41 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=64
Memory bandwidth: 2999.55 GB/s
FLOPs: 342.40 TFLOPs
Config: batch_size=768, seq_len=2048, num_heads=128
Memory bandwidth: 2181.54 GB/s
FLOPs: 471.75 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=64
Memory bandwidth: 2780.86 GB/s
FLOPs: 331.29 TFLOPs
Config: batch_size=64, seq_len=8192, num_heads=128
Memory bandwidth: 2176.12 GB/s
FLOPs: 511.05 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=64
Memory bandwidth: 3031.58 GB/s
FLOPs: 361.15 TFLOPs
Config: batch_size=128, seq_len=8192, num_heads=128
Memory bandwidth: 2165.73 GB/s
FLOPs: 508.61 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=64
Memory bandwidth: 3126.37 GB/s
FLOPs: 372.45 TFLOPs
Config: batch_size=768, seq_len=8192, num_heads=128
Memory bandwidth: 2142.42 GB/s
FLOPs: 503.14 TFLOPs
```1 parent 17ff5a7 commit d7a9234
2 files changed
+57
-24
lines changed| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
629 | 629 | | |
630 | 630 | | |
631 | 631 | | |
632 | | - | |
| 632 | + | |
633 | 633 | | |
634 | 634 | | |
635 | 635 | | |
| |||
| Original file line number | Diff line number | Diff line change | |
|---|---|---|---|
| |||
170 | 170 | | |
171 | 171 | | |
172 | 172 | | |
| 173 | + | |
| 174 | + | |
| 175 | + | |
| 176 | + | |
| 177 | + | |
| 178 | + | |
| 179 | + | |
| 180 | + | |
| 181 | + | |
| 182 | + | |
| 183 | + | |
| 184 | + | |
| 185 | + | |
| 186 | + | |
| 187 | + | |
| 188 | + | |
| 189 | + | |
| 190 | + | |
| 191 | + | |
| 192 | + | |
| 193 | + | |
| 194 | + | |
| 195 | + | |
| 196 | + | |
| 197 | + | |
| 198 | + | |
| 199 | + | |
173 | 200 | | |
174 | | - | |
175 | | - | |
176 | | - | |
177 | | - | |
178 | | - | |
179 | | - | |
| 201 | + | |
| 202 | + | |
| 203 | + | |
| 204 | + | |
| 205 | + | |
| 206 | + | |
| 207 | + | |
180 | 208 | | |
181 | 209 | | |
182 | 210 | | |
183 | 211 | | |
184 | 212 | | |
185 | 213 | | |
186 | | - | |
187 | 214 | | |
188 | 215 | | |
189 | 216 | | |
| |||
193 | 220 | | |
194 | 221 | | |
195 | 222 | | |
196 | | - | |
197 | 223 | | |
198 | 224 | | |
199 | | - | |
200 | 225 | | |
201 | | - | |
202 | | - | |
203 | | - | |
204 | | - | |
205 | | - | |
206 | | - | |
| 226 | + | |
| 227 | + | |
207 | 228 | | |
208 | 229 | | |
209 | 230 | | |
| |||
657 | 678 | | |
658 | 679 | | |
659 | 680 | | |
| 681 | + | |
| 682 | + | |
| 683 | + | |
660 | 684 | | |
661 | 685 | | |
662 | 686 | | |
| |||
681 | 705 | | |
682 | 706 | | |
683 | 707 | | |
| 708 | + | |
| 709 | + | |
| 710 | + | |
684 | 711 | | |
685 | 712 | | |
686 | | - | |
687 | | - | |
688 | | - | |
689 | | - | |
| 713 | + | |
| 714 | + | |
| 715 | + | |
690 | 716 | | |
691 | 717 | | |
692 | 718 | | |
| 719 | + | |
| 720 | + | |
| 721 | + | |
693 | 722 | | |
694 | 723 | | |
695 | 724 | | |
| |||
703 | 732 | | |
704 | 733 | | |
705 | 734 | | |
706 | | - | |
707 | | - | |
708 | | - | |
709 | | - | |
| 735 | + | |
| 736 | + | |
| 737 | + | |
| 738 | + | |
| 739 | + | |
| 740 | + | |
| 741 | + | |
| 742 | + | |
710 | 743 | | |
711 | 744 | | |
712 | 745 | | |
| |||
0 commit comments