[ROCm] Skip 6 Pallas FusedAttentionTest variants exceeding gfx942 LDS limit by srinivamd · Pull Request #797 · ROCm/jax

srinivamd · 2026-06-09T10:49:56Z

Summary

Deselect 6 FusedAttentionTest variants from ROCm CI that fail on all gfx942 GPUs (MI300X/MI308X) with RESOURCE_EXHAUSTED: Shared memory size limit exceeded.

Root Cause

Pallas fused-attention tile configs are designed for NVIDIA's 128KB+ shared memory per SM. AMD gfx942 has 64KB LDS per CU. The XLA autotuner does not pre-filter configs exceeding the target GPU's LDS limit.

Test	LDS Requested	LDS Available
`test_fused_attention_fwd0`	98,304	65,536
`test_fused_attention_fwd1`	98,304	65,536
`test_fused_attention_fwd4`	81,920	65,536
`test_fused_attention_fwd7`	81,920	65,536
`test_fused_attention_bwd7`	98,304	65,536
`test_fused_attention_bwd8`	81,920	65,536

Approach

Uses --deselect in ci/run_pytest_rocm.sh (single-accelerator block only) rather than in-test skipTest guards. This is preferred for v0.9.1 because:

Stable test IDs — v0.9.1 is a release branch; parameterization won't change
Provably correct — test IDs confirmed across two independent CI runs (TheRock Update error message for eigh. jax-ml/jax#1391 and Out of bound error is not raised for output of forward pass jax-ml/jax#1451)
Matches existing pattern — 3 --deselect entries already present in the file
No variant numbering risk — upstream jax-ml/jax#34722 uses skipTest with parameter tuples, but sample_product variant numbering differs between main and v0.9.1

FusedAttentionInterpretTest variants are not deselected — they use a reference Python implementation (no GPU kernel) and continue to validate correctness.

Context

v0.9.2 already has in-test skipTest guards (ROCm/jax@8790d5b) but with different variant-to-parameter mappings
Upstream: jax-ml/jax#34722 (main), openxla/xla#39050 (split_k fix, not in v0.9.1 XLA pin)
Tracked in: ROCM-24925, ROCM-25777

Test plan

Verify nightly CI on gfx942 reports 0 failures (currently 6)
Verify FusedAttentionInterpretTest variants still run and pass
Verify other FusedAttentionTest variants (fwd2, fwd3, fwd5, fwd6, fwd8, fwd9, bwd0-6, bwd9) still run and pass

… limit These 6 tests fail on all AMD gfx942 GPUs (MI300X/MI308X) with: RESOURCE_EXHAUSTED: Shared memory size limit exceeded requested 81920-98304, available 65536 Root cause: Pallas fused-attention tile configs are designed for NVIDIA's 128KB+ shared memory; gfx942 has 64KB LDS per CU. The XLA autotuner does not pre-filter configs exceeding LDS. FusedAttentionInterpretTest variants are unaffected (reference Python implementation, no GPU kernel) and continue to run. Upstream: jax-ml#34722, openxla/xla#39050 Tracked in: ROCM-24925, ROCM-25777

srinivamd · 2026-06-09T10:54:05Z

v0.9.2 vs v0.9.1 skip comparison

v0.9.2 already has in-test skipTest guards for FusedAttentionTest (gpu_ops_test.py lines 103-115, 208-228), cherry-picked from upstream jax-ml/jax#34722. No FusedAttention CI failures are reported on v0.9.2.

However, the variant numbering differs between branches — the same LDS-exceeding parameter combinations land on different test indices:

v0.9.2 in-test skips (by parameter tuple)

Variant	Parameters (batch, seq, heads, dim, blocks)
fwd0	`(1, 384, 2, 72, block_q=128/k=128, causal=False, fwd=True, seg=True)`
fwd5	`(1, 384, 1, 72, block_q=64/k=64, causal=False, fwd=True, seg=True)`
fwd7	`(1, 384, 1, 72, block_q=64/k=128, causal=False, fwd=False, seg=True)`
fwd8	`(2, 384, 1, 64, block_q=64/k=64, causal=True, fwd=False, seg=True)`
bwd1	`(1, 384, 1, 128, block_q=64/k=64/..., causal=True, seg=False)`
bwd2	`(2, 384, 1, 32, block_q=64/k=128/..., causal=False, seg=False)`
bwd7	`(1, 384, 1, 72, block_q=128/k=128/..., causal=False, seg=True)`
bwd9	`(1, 384, 2, 64, block_q=64/k=64/..., causal=True, seg=False)`

v0.9.1 failures (this PR deselects these)

Variant	LDS Requested	LDS Available
fwd0	98,304	65,536
fwd1	98,304	65,536
fwd4	81,920	65,536
fwd7	81,920	65,536
bwd7	98,304	65,536
bwd8	81,920	65,536

Summary

	v0.9.2	v0.9.1 (before)	v0.9.1 (this PR)
Mechanism	In-test `skipTest` by parameter tuple	None	`--deselect` by test ID
fwd skips	fwd0, fwd5, fwd7, fwd8	—	fwd0, fwd1, fwd4, fwd7
bwd skips	bwd1, bwd2, bwd7, bwd9	—	bwd7, bwd8
CI failures	0	6	0 (expected)

Why not cherry-pick v0.9.2's `skipTest` guards?

jtu.sample_product generates a deterministic Cartesian product, but the test index pytest assigns (fwd0, fwd1, ...) can shift between branches due to differences in Python version (3.12 vs 3.14), pytest version, or sample_product internals. The same parameter combos that exceed 64KB LDS land on different indices:

v0.9.2: fwd{0,5,7,8} + bwd{1,2,7,9}
v0.9.1: fwd{0,1,4,7} + bwd{7,8}

Copying v0.9.2's parameter tuples verbatim into v0.9.1 would skip some passing tests and miss some failing tests. The --deselect approach used here is keyed on the exact test IDs confirmed from two independent CI runs (ROCM-24925 build jax-ml#1391, ROCM-25777 build jax-ml#1451) and is stable for this release branch.

magaonka-amd · 2026-06-16T16:14:37Z

Hi Srinivas, Thanks for the PR , I think we have better alternative for this
jax-ml#38372
and I agree we will have to skip these tests , let me make cherry-pick to 9.1 branch.

magaonka-amd · 2026-06-16T16:22:01Z

#802 keep an eye on this

magaonka-amd · 2026-06-16T16:34:18Z

update , you should be unblocked now on 9.1 branch , and you can close this PR.

srinivamd added 4 commits June 9, 2026 03:47

fix: restore missing -m filter and multi-GPU block after snippet error

88fa29c

fix: restore multi-GPU pytest block corrupted by prior snippet

c21ec7a

fix: restore second_cmd_retval and else/fi for multi-GPU block

dd1c9a1

srinivamd requested review from Ruturaj4, Shahzeb-AMD and charleshofer June 9, 2026 14:35

srinivamd mentioned this pull request Jun 9, 2026

[ROCm] Deselect crashing FusedAttention bwd8 + add pytest-timeout (ROCM-25626) #798

Open

4 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[ROCm] Skip 6 Pallas FusedAttentionTest variants exceeding gfx942 LDS limit#797

[ROCm] Skip 6 Pallas FusedAttentionTest variants exceeding gfx942 LDS limit#797
srinivamd wants to merge 4 commits into
rocm-jaxlib-v0.9.1from
skip-fused-attention-lds-v0.9.1

srinivamd commented Jun 9, 2026 •

edited by atlassian Bot

Loading

Uh oh!

srinivamd commented Jun 9, 2026 •

edited by atlassian Bot

Loading

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

srinivamd commented Jun 9, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Root Cause

Approach

Context

Test plan

Uh oh!

srinivamd commented Jun 9, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

v0.9.2 vs v0.9.1 skip comparison

v0.9.2 in-test skips (by parameter tuple)

v0.9.1 failures (this PR deselects these)

Summary

Why not cherry-pick v0.9.2's skipTest guards?

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

magaonka-amd commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

srinivamd commented Jun 9, 2026 •

edited by atlassian Bot

Loading

srinivamd commented Jun 9, 2026 •

edited by atlassian Bot

Loading

Why not cherry-pick v0.9.2's `skipTest` guards?