Skip to content

Commit ac77045

Browse files
authored
Support pack_gqa for ffa fwd (#185)
* add packgqa template * smem 3d copy done * add 2d copy * load with qhead_per_khead=4 correct * support packgqa/nopackgqa done * add qhead_per_khead as template arg * update bench for packgqa * support lse writeback for no q overlap * support packgqa o write back without q overlap * support fwd_epilogue with q overlap * support pack_gqa with full attention * update bench and test for uniform block sparse with packgqa * add packgqa bench * update profile_ffa * fix profile_ffa for packgqa * support pack_gqa with variable block sparse * enhance test_block_sparse_attn without lse * fix lse in test_block_sparse_attn * support all mask type for packgqa * support deterministic for packgqa fwd * fix packgqa bench * simple change tile_scheduler * change fwd tile_scheduler * seperate fwd and bwd tilescheduler * add bwd tilescheduler * support deterministic for fwd tile_scheduler * support deterministic for new tile_scheduler and packgqa * fix lint * format for python code * combine packgqa with swapab * format * refactor fwd_tile_scheduelr * format * fix copyright * add more comments * fix bench * fix test_flex_flash_attn * fix packgqa default value * fix Jit param of packgqa * format * format
1 parent dec7246 commit ac77045

23 files changed

+2107
-388
lines changed

.github/workflows/build_test.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -109,7 +109,7 @@ jobs:
109109
)
110110
runs-on: [self-hosted]
111111
container:
112-
image: registry.cn-sh-01.sensecore.cn/sandai-ccr/magi-base:25.10.3
112+
image: registry.cn-sh-01.sensecore.cn/sandai-ccr/magi-base:25.10.4
113113
options: --gpus all --ipc host
114114
credentials:
115115
username: ${{ secrets.DOCKER_USER_NAME }}

exps/attn/profile_ffa/README.md

Lines changed: 19 additions & 7 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ For now, we test only for dense and block sparse scenerias.
77
#### Model Config
88

99
- nhq: [64]
10-
- nhk: [8]
10+
- nhk: [64] # change nhk to test differnt packgqa settings, for ffa backward of block sparse, gqa performance is bad now.
1111
- headdim: [128]
1212
- dtype: [torch.bfloat16]
1313

@@ -23,8 +23,11 @@ You can change the dense-related settings in `run_dense_tests` within `ffa_bench
2323
#### Block sparse Config
2424

2525
- seqlens_to_test = [49152]
26-
- sparsity_ratios_to_test = [0.1, 0.2, 0.5]
27-
- block_sizes_to_test = [64, 128]
26+
- sparsity_ratios_to_test = [0.05, 0.1, 0.2, 0.5, 1.0]
27+
- q_block_sizes = [64, 128]
28+
- k_block_sizes = [64, 128]
29+
- pack_gqa_options = [False]
30+
- swap_ab_options = [False]
2831

2932
You can change the block_sparse-related settings in `run_block_sparse_tests` within `ffa_benchmark.py`.
3033

@@ -45,7 +48,13 @@ You can change the block_sparse-related settings in `run_block_sparse_tests` wit
4548
TEST_TYPE="dense" # choose from {"dense", "block_sparse"}
4649
OUTPUT_NAME="output"
4750

51+
# you can add --fwd or --bwd to run fwd or bwd only.
52+
# by default we run both fwd and bwd.
4853
PYTHONPATH=../../../ python ffa_benchmark.py --test_type ${TEST_TYPE} --o ${OUT_DIR}/${OUTPUT_NAME}.csv
54+
55+
# you can enable ncu profile for fwd/bwd pass.
56+
# PYTHONPATH=../../../ ncu -f --set full --nvtx --nvtx-include backward_pass -o ncu_output_name \
57+
# python ffa_benchmark.py --test_type ${TEST_TYPE} --bwd --o ${OUT_DIR}/${OUTPUT_NAME}.csv
4958
```
5059

5160
- `compare_ffa_results.py`: compare two output csv with same mask type.
@@ -126,8 +135,8 @@ In dir `optimize_ffa/benchmark_results_time`
126135
| Operation | Time (ms) | Description |
127136
|-------------|-----------|-------------------------|
128137
| range_merge | -1.0000 | RangeMerge |
129-
| Prepare | 0.0153 | prepare_mha_forward |
130-
| Run | 3.4125 | run_mha_forward |
138+
| Prepare | 0.0153 | prepare_ffa_forward |
139+
| Run | 3.4125 | run_ffa_forward |
131140
| Postprocess | 0.0119 | fwd_postprocess |
132141
| to | 0.0032 | cast output to qdtype |
133142

@@ -142,7 +151,10 @@ In dir `optimize_ffa/benchmark_results_time`
142151
| Operation | Time (ms) | Description |
143152
|------------|-----------|-------------------------|
144153
| range_merge| -1.0000 | RangeMerge |
145-
| Prepare | 0.1265 | prepare_mha_backward |
154+
| Prepare | 0.1265 | prepare_ffa_backward |
146155
| Preprocess | 0.1050 | bwd_preprocess |
147-
| Run | 9.3409 | run_mha_backward |
156+
| Run | 9.3409 | run_ffa_backward |
148157
| to | 0.1765 | cast dq, dk, dv |
158+
159+
160+
NOTE: For more detailed and accurate performance Info, please use ncu.

0 commit comments

Comments
 (0)