Skeleton for Attention(23) on CUDA #25684

xadupre · 2025-08-07T15:35:20Z

Description

Draft for Attention(23) on CUDA.

Two directions for the implementations.

contribops

As it is, it does not seem to support SoftCap options and all the modes for the qk output. What about bfloat16?

cublasLtMatMul

Follows the same implementation the one made on CPU, relies on cublasLtMatMul, it should handle all types (bfloat16, float8). It misses cuda code for the cache copy and softcap.

github-actions

You can commit the suggested changes from lintrunner.

github-actions · 2025-08-07T15:41:22Z

onnxruntime/core/providers/cuda/llm/attention_naive.cc

+                           T* output_qk                           // Q*K output
+                           ) {


Suggested change

T* output_qk // Q*K output

) {

T* output_qk // Q*K output

) {

github-actions · 2025-08-07T15:41:22Z

onnxruntime/core/providers/cuda/llm/attention_naive.cc

+  template class NaiveAttention<T>;                                                                                                          \
+  template void ComputeAttentionProbs<T>(cudaStream_t stream, T * attention_probs, const T* Q, const T* K, const Tensor* mask_index,         \
+                                         const AttentionParameters& parameters, const T* past_key, T* present_key,                           \
+                                         T* output_qk);                                                              \


Suggested change

T* output_qk); \

T* output_qk); \

onnxruntime/core/providers/cuda/llm/attention_naive.cc

@@ -0,0 +1,420 @@
+/*


tianleiwu · 2025-08-08T18:24:08Z

MultiHeadAttention cuda implementation is close to onnx Attention definition.
Softcap is easy to add, and qk output need pass the buffers. You can follow GroupQueryAttention cuda kernel to add these.

Skeleton for Attention(23) on CUDA

d16b799

github-actions bot reviewed Aug 7, 2025

View reviewed changes

github-advanced-security bot found potential problems Aug 7, 2025

View reviewed changes

onnxruntime/core/providers/cuda/llm/attention_naive.cc

@@ -0,0 +1,420 @@

/*

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Skeleton for Attention(23) on CUDA #25684

Skeleton for Attention(23) on CUDA #25684

xadupre commented Aug 7, 2025

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Aug 7, 2025

Uh oh!

github-actions bot Aug 7, 2025

Uh oh!

Check warning

tianleiwu commented Aug 8, 2025 •

edited

Loading

Uh oh!

Uh oh!

Skeleton for Attention(23) on CUDA #25684

Are you sure you want to change the base?

Skeleton for Attention(23) on CUDA #25684

Conversation

xadupre commented Aug 7, 2025

Description

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

github-actions bot Aug 7, 2025

Choose a reason for hiding this comment

Uh oh!

Check warning

tianleiwu commented Aug 8, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

tianleiwu commented Aug 8, 2025 •

edited

Loading