Skip to content

Conversation

@samanklesaria
Copy link

@samanklesaria samanklesaria commented Nov 24, 2025

Many operations like dot and conv have an out_sharding argument to reconcile incompatible arguments when using explicit sharding. Currently, dot_product attention lacks such an argument.

Ambiguity about the out_sharding location only occurs in binary operations. Dot product attention has three of them: the $QK^T$ multiplication, the addition of the bias term, and the multiplication by $V$. However, there will be substantial overlap in the output shardings of these intermediate values. Given an additional s_sharding argument, everything is identifiable.

@samanklesaria samanklesaria force-pushed the shard_attention branch 7 times, most recently from 9d541d6 to 133b1ca Compare November 24, 2025 19:32
@samanklesaria samanklesaria force-pushed the shard_attention branch 6 times, most recently from 2d27e5f to df9e8d8 Compare November 24, 2025 22:32
@samanklesaria samanklesaria marked this pull request as ready for review November 24, 2025 23:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants