Updates attention mask and bias documentation for MQA/GQA

algo-home · algo-home · commit 9cf0f040305e · 2025-09-13T18:38:48.000+08:00
Clarifies that attention mask and bias parameters support multiple tensor shapes
to accommodate Multi-Query Attention (MQA) and Grouped Query Attention (GQA)
patterns, in addition to the standard multi-head attention format.

Adds explicit documentation for supported shapes including broadcast-compatible
dimensions for flexible attention implementations.
diff --git a/flash_dmattn/flash_dmattn_interface.py b/flash_dmattn/flash_dmattn_interface.py
@@ -361,10 +361,14 @@ def flash_dmattn_func(
         key: torch.Tensor. The key tensor of shape (batch_size, seqlen, nheads_k, headdim)
         value: torch.Tensor. The value tensor of shape (batch_size, seqlen, nheads_k, headdim)
         attn_mask: torch.Tensor, optional. The attention mask boolean tensor of
-            shape (batch_size, nheads_k, seqlen_q, seqlen_k) to apply to the attention scores.
+            shape (batch_size, nheads, seqlen_q, seqlen_k) to apply to the attention scores.
+            Also supports shape (batch_size, nheads_k, seqlen_q, seqlen_k) or
+            (batch_size, 1, seqlen_q, seqlen_k) for MQA/GQA.
             If None, no mask is applied.
         attn_bias: torch.Tensor, optional. The attention bias float tensor of
-            shape (batch_size, nheads_k, seqlen_q, seqlen_k) to add to the attention scores.
+            shape (batch_size, nheads, seqlen_q, seqlen_k) to add to the attention scores.
+            Also supports shape (batch_size, nheads_k, seqlen_q, seqlen_k) or
+            (batch_size, 1, seqlen_q, seqlen_k) for MQA/GQA.
             If None, no bias is applied.
         is_causal: bool. Whether to apply causal attention mask (e.g., for auto-regressive modeling).
         scale: float. The scaling of QK^T before applying softmax.