Skip to content

Commit 7bd6299

Browse files
committed
updated exps
1 parent 96f02e8 commit 7bd6299

File tree

49 files changed

+72
-6
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

49 files changed

+72
-6
lines changed

_pages/dffa.md

Lines changed: 12 additions & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -257,10 +257,9 @@ To demonstrate FFA kernels' state-of-the-art performance and flexibility in hand
257257
| number of heads (nh) | nhq:nhk:nhv = 64:8:8 (GQA) |
258258
| head dimension (hd) | 128 |
259259
| dtype | torch.bfloat16 |
260-
| dropout probability | 0.0 |
261260
| window size | 1024 (for sliding window masks only) |
262261

263-
Benchmark settings: for each mask pattern, we vary the sequence length $seqlen$ from $4k,8k,16k,...,$ up to $128k$ ($seqlen_q = seqlen_k = seqlen$) while measuring computation power (in $\texttt{TFLOPs/s}$) for forward and backward passes of different attention kernels. Other configurations are fixed using common training settings (see the table above) to focus on the impact of sequence length and mask pattern. For the varlen packed data, we simply follow the variable sequence length distribution in the open-sourced dataset<d-cite key="xu2024chatqa"></d-cite> illustrated in the following figure, from which we sample to pack and pad to the required $seqlen$.
262+
Benchmark settings: for each mask pattern, we vary the sequence length $seqlen$ from $4k,8k,16k,...,$ up to $128k$ ($seqlen_q = seqlen_k = seqlen$) while measuring the throughput (in $\texttt{TFLOPs/s}$) for forward and backward passes of different attention kernels. Other configurations are fixed using common training settings (see the table above) to focus on the impact of sequence length and mask pattern. For the varlen packed data, we simply follow the variable sequence length distribution in the open-sourced dataset<d-cite key="xu2024chatqa"></d-cite> illustrated in the following figure, from which we sample to pack and pad to the required $seqlen$.
264263

265264
<div class="l-middle" align="center">
266265
<img src="assets/img/magiattn/varlen_seqlen_distribution.png" width="80%">
@@ -273,33 +272,40 @@ Benchmark settings: for each mask pattern, we vary the sequence length $seqlen$
273272
Results are reported in the following figures.
274273

275274
<div class="l-middle">
276-
<img src="assets/img/magiattn/ffa_exp/ffa_perf_report_full_all_family.png" width="100%">
275+
<img src="assets/img/magiattn/ffa_exp/attn with fulll mask/perf_report_all.png" width="100%">
277276
<div class="caption">
278277
Benchmarking FFA's performance and flexibility against other leading attention kernels for full mask scenarios.
279278
</div>
280279
</div>
281280

282281
<div class="l-middle">
283-
<img src="assets/img/magiattn/ffa_exp/ffa_perf_report_causal_all_family.png" width="100%">
282+
<img src="assets/img/magiattn/ffa_exp/attn with causal mask/perf_report_all.png" width="100%">
284283
<div class="caption">
285284
Benchmarking FFA's performance and flexibility against other leading attention kernels for causal mask scenarios.
286285
</div>
287286
</div>
288287

289288
<div class="l-middle">
290-
<img src="assets/img/magiattn/ffa_exp/ffa_perf_report_varlen_full_all_family.png" width="100%">
289+
<img src="assets/img/magiattn/ffa_exp/attn with varlen full mask/perf_report_all.png" width="100%">
291290
<div class="caption left">
292291
Benchmarking FFA's performance and flexibility against other leading attention kernels for varlen full mask scenarios. (Note that: the $\mathbf{E}$ symbol indicates the corresponding distributed attention implementation raises <em>Cuda Out of Memory</em> error in that specific configuration.)
293292
</div>
294293
</div>
295294

296295
<div class="l-middle">
297-
<img src="assets/img/magiattn/ffa_exp/ffa_perf_report_varlen_causal_all_family.png" width="100%">
296+
<img src="assets/img/magiattn/ffa_exp/attn with varlen causal mask/perf_report_all.png" width="100%">
298297
<div class="caption left">
299298
Benchmarking FFA's performance and flexibility against other leading attention kernels for varlen causal mask scenarios. (Note that: the $\mathbf{E}$ symbol indicates the corresponding distributed attention implementation raises <em>Cuda Out of Memory</em> error in that specific configuration.)
300299
</div>
301300
</div>
302301

302+
<div class="l-middle">
303+
<img src="zeus/assets/img/magiattn/ffa_exp/attn with sw causal mask/perf_report_all.png" width="100%">
304+
<div class="caption left">
305+
Benchmarking FFA's performance and flexibility against other leading attention kernels for sliding-window causal mask scenarios. (Note that: the $\mathbf{E}$ symbol indicates the corresponding distributed attention implementation raises <em>Cuda Out of Memory</em> error in that specific configuration.)
306+
</div>
307+
</div>
308+
303309

304310

305311
### Module-Level
Binary file not shown.
29.9 KB
Loading
-14 Bytes
Binary file not shown.
32.1 KB
Loading
Binary file not shown.
28.6 KB
Loading
Binary file not shown.
28.3 KB
Loading
Binary file not shown.

0 commit comments

Comments
 (0)