[JAX] Default to fused attention in JAX DPA (NVIDIA#2363)

KshitijLakhani · greptile-apps[bot] · web-flow · commit 5978f1d7544f · 2025-11-07T11:17:16.000-08:00
* Default to fused attention in JAX DPA

Signed-off-by: Kshitij Lakhani &lt;klakhani@nvidia.com&gt;

* Consolidate documentation for DPA in JAX

Co-authored-by: greptile-apps[bot] &lt;165735046+greptile-apps[bot]@users.noreply.github.com&gt;
Signed-off-by: Kshitij Lakhani &lt;33047503+KshitijLakhani@users.noreply.github.com&gt;

* Correctly update the documentation for defaults in JAX DPA

Co-authored-by: greptile-apps[bot] &lt;165735046+greptile-apps[bot]@users.noreply.github.com&gt;
Signed-off-by: Kshitij Lakhani &lt;33047503+KshitijLakhani@users.noreply.github.com&gt;

---------

Signed-off-by: Kshitij Lakhani &lt;klakhani@nvidia.com&gt;
Signed-off-by: Kshitij Lakhani &lt;33047503+KshitijLakhani@users.noreply.github.com&gt;
Co-authored-by: greptile-apps[bot] &lt;165735046+greptile-apps[bot]@users.noreply.github.com&gt;
diff --git a/transformer_engine/jax/flax/transformer.py b/transformer_engine/jax/flax/transformer.py
@@ -407,10 +407,10 @@ class DotProductAttention(nn.Module):  # pylint: disable=too-few-public-methods
         Users can select between these two backends via the :attr:`NVTE_FUSED_ATTN` environment
         variable:
 
-        * Set :attr:`NVTE_FUSED_ATTN=0` for unfused attention (default).
-        * Set :attr:`NVTE_FUSED_ATTN=1` for fused attention. If the required cuDNN fused attention
-          kernel is not available on the system, a warning will be issued, and the module will
-          automatically fall back to the unfused backend.
+        * Set :attr:`NVTE_FUSED_ATTN=0` for unfused attention.
+        * Set :attr:`NVTE_FUSED_ATTN=1` for fused attention (default). If the required cuDNN fused
+          attention kernel is not available on the system, a warning will be issued, and the module
+          will automatically fall back to the unfused backend.
 
     .. note::
         The DotProductAttention default setting enables non-deterministic kernels for reduced
@@ -602,7 +602,8 @@ def __call__(
         else:
             assert bias is not None
 
-        enable_fused_attn = int(os.getenv("NVTE_FUSED_ATTN", "0"))
+        # Use fused attn (if kernel check below passes) by default
+        enable_fused_attn = int(os.getenv("NVTE_FUSED_ATTN", "1"))
 
         sequence_dim = 0 if self.transpose_batch_sequence else 1
         seqlen_q = query.shape[sequence_dim]