Add one autotuning config to the flex attention benchmark #5303

admitric · 2025-10-14T11:33:58Z

This PR adds a FlexAttention autotuner config that can show better performance on one of the shapes.

anmyachev · 2025-10-15T13:37:25Z

@chengjunlu @whitneywhtsang could you take a look?

whitneywhtsang · 2025-10-15T14:49:50Z

@admitric I tried to perform performance measurement with the additional config on PVC and BMG, but noticed no performance difference on default runners. I suspect the performance improvement can only be observed with updated driver. Will kick off a run to prove it when runners with updated drivers are available. Where do you expect the performance improvement? BMG?

admitric · 2025-10-15T16:42:18Z

PVC on shape [1, 128, 128, 1024, 1024, 192, 128]

whitneywhtsang · 2025-10-15T17:38:41Z

PVC on shape [1, 128, 128, 1024, 1024, 192, 128]

@admitric No performance improvement is observed with agama 1188, do we also need vectorization enabled?
On PVC:
Before this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/18537552569
After this PR: https://github.com/intel/intel-xpu-backend-for-triton/actions/runs/18536568349

Z	H_q	H_kv	N_CTX_q	N_CTX_kv	D_HEAD_qk	D_HEAD_v	MODE	Before	After	Ratio
1	32	32	1024	1024	96	96	fwd	40.02019	39.65073	0.990768
1	128	128	1024	1024	192	128	fwd	68.04017	67.29814	0.989094
1	32	32	512	1664	96	96	fwd	93.87538	93.33971	0.994294
1	128	1	512	1664	64	512	fwd	69.27902	61.22093	0.883686
1	32	8	1024	1024	128	128	fwd	50.84005	50.93652	1.001897
1	28	4	1024	1024	128	128	fwd	46.50534	46.96446	1.009873
1	32	8	512	1664	128	128	fwd	122.187	121.591	0.995122
1	28	4	512	1664	128	128	fwd	113.9348	112.7984	0.990026
1	32	8	1	1088	128	128	fwd	0.371991	0.36983	0.994191

chengjunlu · 2025-10-16T01:14:34Z

Changes looks good to me.

admitric · 2025-10-16T15:39:11Z

I have different results on PVC, they reproduce stabily. I am using Triton main branch (on 2908846).
My data (machine DUT1005-PVC):
hardcoded FlexConfig(128, 64, 2, 8), agama-1188, spill size 1344

   Z  H_q  H_kv  N_CTX_q  N_CTX_kv  D_HEAD_qk  D_HEAD_v MODE  Triton-GB/s  Torch-GB/s  Triton-GB/s-min  Torch-GB/s-min  Triton-GB/s-max  Torch-GB/s-max  Triton-TFlops  Torch-TFlops  Triton-TFlops-min  Torch-TFlops-min  Triton-TFlops-max  Torch-TFlops-max  Triton-CV  Torch-CV
0  1  128   128     1024      1024        192       128  fwd   133.007351   10.326847       129.975336        10.23494       135.321954        10.51606      42.562352      3.304591          41.592107          3.275181          43.303025          3.365139   0.011254  0.007157

hardcoded [FlexConfig(64, 32, 2, 4)], agama-1188, spill size 0

   Z  H_q  H_kv  N_CTX_q  N_CTX_kv  D_HEAD_qk  D_HEAD_v MODE  Triton-GB/s  Torch-GB/s  Triton-GB/s-min  Torch-GB/s-min  Triton-GB/s-max  Torch-GB/s-max  Triton-TFlops  Torch-TFlops  Triton-TFlops-min  Torch-TFlops-min  Triton-TFlops-max  Torch-TFlops-max  Triton-CV  Torch-CV
0  1  128   128     1024      1024        192       128  fwd   172.529671   10.363806       166.308649       10.237813       175.824937       10.468552      55.209495      3.316418          53.218768            3.2761           56.26398          3.349937   0.014669  0.005706

etiotto · 2025-10-22T18:23:31Z

OK I think we need to wait for the new driver to be available in CI, so that we can confirm the new config works as expected. Note that BMG performance impact is the one we should prioritize.

Add one autotuning config to the flex attention benchmark

8805c02

whitneywhtsang self-requested a review October 15, 2025 14:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Add one autotuning config to the flex attention benchmark #5303

Add one autotuning config to the flex attention benchmark #5303

Uh oh!

admitric commented Oct 14, 2025

Uh oh!

anmyachev commented Oct 15, 2025

Uh oh!

whitneywhtsang commented Oct 15, 2025 •

edited

Loading

Uh oh!

admitric commented Oct 15, 2025

Uh oh!

whitneywhtsang commented Oct 15, 2025 •

edited

Loading

Uh oh!

chengjunlu commented Oct 16, 2025

Uh oh!

admitric commented Oct 16, 2025

Uh oh!

etiotto commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Uh oh!

Add one autotuning config to the flex attention benchmark #5303

Are you sure you want to change the base?

Add one autotuning config to the flex attention benchmark #5303

Uh oh!

Conversation

admitric commented Oct 14, 2025

Uh oh!

anmyachev commented Oct 15, 2025

Uh oh!

whitneywhtsang commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

admitric commented Oct 15, 2025

Uh oh!

whitneywhtsang commented Oct 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

chengjunlu commented Oct 16, 2025

Uh oh!

admitric commented Oct 16, 2025

Uh oh!

etiotto commented Oct 22, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

whitneywhtsang commented Oct 15, 2025 •

edited

Loading

whitneywhtsang commented Oct 15, 2025 •

edited

Loading