SparseAttention ONNX Contrib Op Implementation #4275

music-dino · 2025-09-03T12:39:18Z

No description provided.

migraphx-bot · 2025-09-03T16:13:31Z

Test	Batch	Rate new 2ed947	Rate old 8177ed	Diff	Compare
torchvision-resnet50	64	3,175.23	3,156.64	0.59%	✅
torchvision-resnet50_fp16	64	6,610.72	6,585.90	0.38%	✅
torchvision-densenet121	32	2,444.37	2,434.16	0.42%	✅
torchvision-densenet121_fp16	32	4,114.20	4,100.96	0.32%	✅
torchvision-inceptionv3	32	1,672.64	1,664.47	0.49%	✅
torchvision-inceptionv3_fp16	32	2,596.43	2,579.29	0.66%	✅
cadene-inceptionv4	16	797.69	794.64	0.38%	✅
cadene-resnext64x4	16	807.08	802.37	0.59%	✅
slim-mobilenet	64	8,237.03	8,205.30	0.39%	✅
slim-nasnetalarge	64	222.79	221.58	0.55%	✅
slim-resnet50v2	64	3,308.52	3,295.13	0.41%	✅
bert-mrpc-onnx	8	1,143.12	1,131.65	1.01%	✅
bert-mrpc-tf	1	479.43	478.53	0.19%	✅
pytorch-examples-wlang-gru	1	295.97	294.77	0.41%	✅
pytorch-examples-wlang-lstm	1	405.78	409.45	-0.90%	✅
torchvision-resnet50_1	1	793.98	800.17	-0.77%	✅
cadene-dpn92_1	1	413.65	411.44	0.54%	✅
cadene-resnext101_1	1	369.96	368.48	0.40%	✅
onnx-taau-downsample	1	398.54	397.45	0.27%	✅
dlrm-criteoterabyte	1	32.04	31.90	0.45%	✅
dlrm-criteoterabyte_fp16	1	51.02	50.96	0.12%	✅
agentmodel	1	9,366.63	9,103.57	2.89%	✅
unet_fp16	2	58.93	58.78	0.27%	✅
resnet50v1_fp16	1	963.57	951.81	1.24%	✅
resnet50v1_int8	1	968.24	969.07	-0.09%	✅
bert_base_cased_fp16	64	1,114.37	1,109.23	0.46%	✅
bert_large_uncased_fp16	32	345.55	343.63	0.56%	✅
bert_large_fp16	1	196.66	196.18	0.24%	✅
distilgpt2_fp16	16	2,106.52	2,093.09	0.64%	✅
yolov5s	1	580.47	580.29	0.03%	✅
tinyllama	1	43.95	43.78	0.39%	✅
vicuna-fastchat	1	45.26	45.11	0.34%	✅
whisper-tiny-encoder	1	411.37	409.17	0.54%	✅
whisper-tiny-decoder	1	412.82	411.02	0.44%	✅
llama2_7b	1	19.17	19.11	0.30%	✅
qwen1.5-7b	1	23.51	23.42	0.42%	✅
phi3-3.8b	1	26.67	26.58	0.35%	✅
mask-rcnn	1	11.93	11.96	-0.23%	✅
llama3-8b	1	21.74	21.67	0.29%	✅
whisper-large-encoder	1	10.22	10.17	0.51%	✅
whisper-large-decoder	1	96.57	95.77	0.83%	✅
mistral-7b	1	23.73	23.63	0.40%	✅
FLUX.1-schnell	1	708.46	702.58	0.84%	✅
nan	nan	nan	nan	nan%	❌

This build is not recommended to merge 🔴

migraphx-bot · 2025-09-03T16:13:32Z

✅ bert-mrpc-onnx: PASSED: MIGraphX meets tolerance

❌bert-mrpc-tf: ERROR - check error output

2025-09-03 10:20:56.197188: I tensorflow/core/platform/cpu_feature_guard.cc:210] This TensorFlow binary is optimized to use available CPU instructions in performance-critical operations.
To enable the following instructions: SSE3 SSE4.1 SSE4.2 AVX AVX2 FMA, in other operations, rebuild TensorFlow with the appropriate compiler flags.
Traceback (most recent call last):
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 359, in
main()
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 306, in main
graph = load_tf_graph(model_name)
File "/src/AMDMIGraphX/tools/accuracy/accuracy_checker.py", line 300, in load_tf_graph
graph_def.ParseFromString(f.read())
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/lib/io/file_io.py", line 116, in read
self._preread_check()
File "/usr/local/lib/python3.10/dist-packages/tensorflow/python/lib/io/file_io.py", line 77, in _preread_check
self._read_buf = _pywrap_file_io.BufferedInputStream(
tensorflow.python.framework.errors_impl.UnimplementedError: File system scheme '[local]' not implemented (file: '/new-saved-models/tf-misc/bert_mrpc1.pb')

✅ pytorch-examples-wlang-gru: PASSED: MIGraphX meets tolerance

✅ pytorch-examples-wlang-lstm: PASSED: MIGraphX meets tolerance

✅ dlrm-criteoterabyte: PASSED: MIGraphX meets tolerance

✅ agentmodel: PASSED: MIGraphX meets tolerance

✅ unet: PASSED: MIGraphX meets tolerance

✅ resnet50v1: PASSED: MIGraphX meets tolerance

✅ bert_base_cased_fp16: PASSED: MIGraphX meets tolerance

🔴bert_large_uncased_fp16: FAILED: MIGraphX is not within tolerance - check verbose output

✅ bert_large: PASSED: MIGraphX meets tolerance

✅ yolov5s: PASSED: MIGraphX meets tolerance

✅ tinyllama: PASSED: MIGraphX meets tolerance

✅ vicuna-fastchat: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-encoder: PASSED: MIGraphX meets tolerance

✅ whisper-tiny-decoder: PASSED: MIGraphX meets tolerance

✅ distilgpt2_fp16: PASSED: MIGraphX meets tolerance

✅ llama2_7b: PASSED: MIGraphX meets tolerance

✅ qwen1.5-7b: PASSED: MIGraphX meets tolerance

✅ phi3-3.8b: PASSED: MIGraphX meets tolerance

🔴mask-rcnn: FAILED: MIGraphX is not within tolerance - check verbose output

✅ llama3-8b: PASSED: MIGraphX meets tolerance

✅ whisper-large-decoder: PASSED: MIGraphX meets tolerance

✅ mistral-7b: PASSED: MIGraphX meets tolerance

✅ FLUX.1-schnell: PASSED: MIGraphX meets tolerance

music-dino · 2025-09-09T16:11:47Z

src/onnx/parse_sparse_attention.cpp

+            updates);
+    }
+
+    instruction_ref make_block_masks(module& mod,


An issue presents itself with applying the block mask

https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#com.microsoft.SparseAttention

The output of the first GEMM has shape BNSM, where:
B = batch size
N = num. heads
S = sequence lengths
M = max cache sequence length

The block mask, once unpacked and expanded will have shape BNXX, where:
X = max_blocks * sparse_block_size

In cases when X != S and/or X != M, the block mask needs to be trimmed down to BNSM dims, so that it can be applied to the GEMM output by using a where.

The particular case that causes the issue: S = 1
When the sequence length is equal to one, the block mask needs to be sliced down to size 1 on axis 2, that is to say it should be sliced from N to N + 1. But what should N be?
This detail is not documented, but the implementation tells us that it should be past_sequence_length.

How is past_sequence_length obtained?
The operator has as input called key_total_sequence_lengths which is described as:
1D tensor with shape (batch_size) where each value is total sequence length of key excluding paddings.
The past_sequence_length is obtained by subtracting the sequence length from key_total_sequence_lengths.
As a consequence, we end up in a situation where the slice start and end depend on a runtime value, making the slice dynamic.

Not sure how to circumvent this.
@TedThemistokleous

music-dino added 2 commits September 3, 2025 11:50

Parser implementation start

c8e2846

Add methods for implementing major parts of operator

2ed947e

music-dino self-assigned this Sep 3, 2025

music-dino requested review from TedThemistokleous and pfultz2 September 3, 2025 12:57

music-dino commented Sep 9, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

SparseAttention ONNX Contrib Op Implementation #4275

SparseAttention ONNX Contrib Op Implementation #4275

music-dino commented Sep 3, 2025

Uh oh!

migraphx-bot commented Sep 3, 2025

Uh oh!

migraphx-bot commented Sep 3, 2025

Uh oh!

music-dino Sep 9, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

SparseAttention ONNX Contrib Op Implementation #4275

Are you sure you want to change the base?

SparseAttention ONNX Contrib Op Implementation #4275

Conversation

music-dino commented Sep 3, 2025

Uh oh!

migraphx-bot commented Sep 3, 2025

Uh oh!

migraphx-bot commented Sep 3, 2025

Uh oh!

music-dino Sep 9, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants