Skip to content

Conversation

xadupre
Copy link
Member

@xadupre xadupre commented Aug 7, 2025

Description

Draft for Attention(23) on CUDA.

Two directions for the implementations.

contribops

As it is, it does not seem to support SoftCap options and all the modes for the qk output. What about bfloat16?

cublasLtMatMul

Follows the same implementation the one made on CPU, relies on cublasLtMatMul, it should handle all types (bfloat16, float8). It misses cuda code for the cache copy and softcap.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can commit the suggested changes from lintrunner.

Comment on lines +30 to +31
T* output_qk // Q*K output
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T* output_qk // Q*K output
) {
T* output_qk // Q*K output
) {

template class NaiveAttention<T>; \
template void ComputeAttentionProbs<T>(cudaStream_t stream, T * attention_probs, const T* Q, const T* K, const Tensor* mask_index, \
const AttentionParameters& parameters, const T* past_key, T* present_key, \
T* output_qk); \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
T* output_qk); \
T* output_qk); \

@@ -0,0 +1,420 @@
/*

Check warning

Code scanning / lintrunner

CLANGFORMAT/format Warning

See https://clang.llvm.org/docs/ClangFormat.html.
Run lintrunner -a to apply this patch.
@tianleiwu
Copy link
Contributor

tianleiwu commented Aug 8, 2025

MultiHeadAttention cuda implementation is close to onnx Attention definition.
Softcap is easy to add, and qk output need pass the buffers. You can follow GroupQueryAttention cuda kernel to add these.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants