Commit 85a9a8d

and

authored

benchmark: trtllm-gen mha with sink, add benchmark args (#1415)

## 📌 Description [b200] Benchmark results `python3 benchmarks/bench_trtllm_fmha.py` > batch_size=4, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.008246400207281113ms > memory bandwidth: 973.91 GB/s > batch_size=4, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.011433599889278412ms > memory bandwidth: 2801.50 GB/s > batch_size=4, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.015436799824237823ms > memory bandwidth: 4147.96 GB/s > batch_size=4, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.03059200048446655ms > memory bandwidth: 4185.12 GB/s > batch_size=128, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.05228480100631714ms > memory bandwidth: 4915.39 GB/s > batch_size=128, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.17578879594802857ms > memory bandwidth: 5830.86 GB/s > batch_size=128, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.6801599502563476ms > memory bandwidth: 1219.53 GB/s > batch_size=128, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 5.439588928222657ms > memory bandwidth: 753.18 GB/s > batch_size=256, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.09621120095252991ms > memory bandwidth: 5342.41 GB/s > batch_size=256, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.7040287971496584ms > memory bandwidth: 1203.03 GB/s > batch_size=256, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 4.829526329040528ms > memory bandwidth: 848.53 GB/s > batch_size=256, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 12.651522827148437ms > memory bandwidth: 647.67 GB/s `python3 benchmarks/bench_attention_sink_triton_sgl_decode.py` > batch_size=4, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.027424000203609467ms > memory bandwidth: 292.85 GB/s > batch_size=4, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.06857600063085556ms > memory bandwidth: 467.09 GB/s > batch_size=4, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.12588800489902496ms > memory bandwidth: 508.64 GB/s > batch_size=4, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.2343679964542389ms > memory bandwidth: 546.28 GB/s > batch_size=128, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.09935999661684036ms > memory bandwidth: 2586.55 GB/s > batch_size=128, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.2919679880142212ms > memory bandwidth: 3510.66 GB/s > batch_size=128, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.5479679703712463ms > memory bandwidth: 3739.27 GB/s > batch_size=128, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.060703992843628ms > memory bandwidth: 3862.53 GB/s > batch_size=256, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.17817600071430206ms > memory bandwidth: 2884.79 GB/s > batch_size=256, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.566208004951477ms > memory bandwidth: 3620.58 GB/s > batch_size=256, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.0823359489440918ms > memory bandwidth: 3786.26 GB/s > batch_size=256, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 2.114527940750122ms > memory bandwidth: 3875.10 GB/s `python3 benchmarks/bench_trtllm_fmha.py --sink` > batch_size=4, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.00806720033288002ms > memory bandwidth: 995.54 GB/s > batch_size=4, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.011334399878978729ms > memory bandwidth: 2826.02 GB/s > batch_size=4, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.01525759994983673ms > memory bandwidth: 4196.68 GB/s > batch_size=4, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.03030720055103302ms > memory bandwidth: 4224.45 GB/s > batch_size=128, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.05234559774398804ms > memory bandwidth: 4909.68 GB/s > batch_size=128, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.1760032057762146ms > memory bandwidth: 5823.76 GB/s > batch_size=128, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.7758047103881835ms > memory bandwidth: 1153.84 GB/s > batch_size=128, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 5.47153606414795ms > memory bandwidth: 748.78 GB/s > batch_size=256, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.09606080055236817ms > memory bandwidth: 5350.78 GB/s > batch_size=256, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.732806396484375ms > memory bandwidth: 1183.05 GB/s > batch_size=256, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 4.8847806930542ms > memory bandwidth: 838.93 GB/s > batch_size=256, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 13.09429473876953ms > memory bandwidth: 625.77 GB/s `python3 benchmarks/bench_attention_sink_triton_sgl_decode.py --sink` > batch_size=4, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.02755199931561947ms > memory bandwidth: 291.49 GB/s > batch_size=4, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.06857600063085556ms > memory bandwidth: 467.09 GB/s > batch_size=4, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.12588800489902496ms > memory bandwidth: 508.64 GB/s > batch_size=4, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.2343679964542389ms > memory bandwidth: 546.28 GB/s > batch_size=128, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.09935999661684036ms > memory bandwidth: 2586.55 GB/s > batch_size=128, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.2919679880142212ms > memory bandwidth: 3510.66 GB/s > batch_size=128, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.5479999780654907ms > memory bandwidth: 3739.05 GB/s > batch_size=128, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.060703992843628ms > memory bandwidth: 3862.53 GB/s > batch_size=256, seq_len=1024, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.17817600071430206ms > memory bandwidth: 2884.79 GB/s > batch_size=256, seq_len=4096, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 0.566208004951477ms > memory bandwidth: 3620.58 GB/s > batch_size=256, seq_len=8192, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 1.0823359489440918ms > memory bandwidth: 3786.26 GB/s > batch_size=256, seq_len=16384, num_qo_heads=64, num_kv_heads=8, head_dim=64, page_size=16 > execution time: 2.114527940750122ms > memory bandwidth: 3875.10 GB/s ## 🔍 Related Issues  ## 🚀 Pull Request Checklist Thank you for contributing to FlashInfer! Before we review your pull request, please make sure the following items are complete. ### ✅ Pre-commit Checks - [x] I have installed `pre-commit` by running `pip install pre-commit` (or used your preferred method). - [x] I have installed the hooks with `pre-commit install`. - [x] I have run the hooks manually with `pre-commit run --all-files` and fixed any reported issues. > If you are unsure about how to set up `pre-commit`, see [the pre-commit documentation](https://pre-commit.com/). ## 🧪 Tests - [x] Tests have been added or updated as needed. - [x] All tests are passing (`unittest`, etc.). ## Reviewer Notes  --------- Co-authored-by: gemini-code-assist[bot] <176961590+gemini-code-assist[bot]@users.noreply.github.com>

1 parent 5451029 commit 85a9a8dCopy full SHA for 85a9a8d

3 files changed

+1565

-9

lines changed

benchmarks

3 files changed

+1565

-9

lines changed

Comments

(0)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commit 85a9a8d

3 files changed

3 files changed

File tree

3 files changed

3 files changed

0 commit comments