Enable gfx950 CI on dev branch #401

VeeraRajasekhar · 2025-12-12T22:45:30Z

Description

Enable gfx950 (MI350) CI by addressing the specific failures we saw: FP8 GEMM coverage gaps in hipBLASLt, RMSNorm misalignment on odd strides (e.g., N=17389), fused optimizer tolerances, and unsupported quantized/activation-recompute test cases on ROCm.
Prevent JAX GEMM/grouped-GEMM FFI from being marked cudaGraph-safe on ROCm to avoid failures; keep gfx950 FP8 layout support disabled until hipBLASLt coverage is validated.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
[] New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Disable cudaGraph registration for JAX gemm and grouped_gemm FFI on ROCm to stop graph-capture hangs for gfx950 (transformer_engine/jax/csrc/extensions/gemm.cpp).
Keep is_fp8_gemm_with_all_layouts_supported false on gfx950 until hipBLASLt FP8 layout coverage is validated (transformer_engine/jax/quantize/device_utils.py).
Fix RMSNorm Triton kernel for misaligned row strides by only applying 16B alignment hints when the pointers/strides are aligned; this resolves test_norms dgamma mismatches and the test_transformer_layer_hidden_states_format numerics issues. Also relax fused-optimizer FP8 tolerances on MI350 (transformer_engine/pytorch/triton_kernels/rmsnorm.py, tests/pytorch/test_numerics.py, tests/pytorch/test_fused_optimizer.py).
Skip unsupported FP8 quantized linear combinations on gfx950 where hipBLASLt lacks algorithms (tests/pytorch/test_fusible_ops.py).
Add gfx950 detection helper and skip test_gpt_full_activation_recompute on MI350 configs that hipBLASLt cannot serve (transformer_engine/pytorch/utils.py, tests/pytorch/test_numerics.py).

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

transformer_engine/jax/quantize/device_utils.py

transformer_engine/pytorch/triton_kernels/rmsnorm.py

tests/pytorch/test_fused_optimizer.py

tests/pytorch/test_numerics.py

transformer_engine/jax/csrc/extensions/gemm.cpp

transformer_engine/jax/quantize/device_utils.py

transformer_engine/pytorch/triton_kernels/rmsnorm.py

tests/pytorch/test_fused_optimizer.py

ci/ci_config.json

tests/pytorch/test_fused_optimizer.py

ipanfilo · 2025-12-18T06:16:15Z

tests/cpp/operator/test_cublaslt_gemm.cu


 #ifdef __HIP_PLATFORM_AMD__

+  // Temporary skip: gfx950 TN kernels for (M,K,N)=(2304,768,4096) are unstable.


What does unstable mean?

6192 - OperatorTest/GEMMTestSuite.Testfp8xfp8xbf16xbf16xbf16/2304x768x4096x0x0xTNxM # GetParam() = ((2304, 768, 4096), false, false, (true, false), 1) (Failed) 6768 - OperatorTest/GEMMTestSuite.Testfp8xbf8xbf16xbf16xfp16/2304x768x4096x0x0xTNxM # GetParam() = ((2304, 768, 4096), false, false, (true, false), 1) (Failed) 7344 - OperatorTest/GEMMTestSuite.Testbf8xfp8xbf16xbf16xfp32/2304x768x4096x0x0xTNxM # GetParam() = ((2304, 768, 4096), false, false, (true, false), 1) (Failed) 7488 - OperatorTest/GEMMTestSuite.Testbf8xfp8xbf16xbf16xfp16/2304x768x4096x0x0xTNxM # GetParam() = ((2304, 768, 4096), false, false, (true, false), 1) (Failed)

These testcases are failing at random, so we decided to skip for this mi350 bring up. When I tested on Rocm7.2 there was no issue

Guard it with #if HIP_VERSION < 70200000 then. So comments about temporary disable and re-enable and mentioning of ROCm 7.2 can be removed

tests/pytorch/test_numerics.py

ipanfilo · 2025-12-19T02:41:44Z

tests/cpp/operator/test_cublaslt_gemm.cu


 #ifdef __HIP_PLATFORM_AMD__

+  // Temporary skip: gfx950 TN kernels for (M,K,N)=(2304,768,4096) are unstable.


Guard it with #if HIP_VERSION < 70200000 then. So comments about temporary disable and re-enable and mentioning of ROCm 7.2 can be removed

ipanfilo · 2025-12-19T02:44:26Z

tests/cpp/operator/test_cublaslt_gemm.cu

+  // Re-enable after ROCm 7.2 once hipBLASLt fixes land.
+  if (prop.major == 9 && prop.minor == 5 &&
+      params.transa && !params.transb &&
+      params.m == 2304 && params.k == 768 && params.n == 4096) {


There is only 1 size for DqTest. Instead of skipping the test just use different size for test_case_sizes_mxfp8, for example 768, 3072, 4096

tests/cpp/operator/test_cublaslt_gemm.cu

tests/pytorch/test_fused_optimizer.py

VeeraRajasekhar · 2026-01-08T19:21:17Z

rebase to dev

…or gfx950 ci enablement

…ed with hipblaslt

VeeraRajasekhar · 2026-01-09T00:35:42Z

Test report for MI355 with Level=3:

No issues with sgpu tests reported.
Pytorch Mgpu tests had no issues
Jax test [auto] test_distributed_fused_attn.py timeout is triggered due to hang which is known. Other Jax tests passed

VeeraRajasekhar self-assigned this Dec 12, 2025

VeeraRajasekhar requested review from ipanfilo, wangye805 and wenchenvincent as code owners December 12, 2025 22:45

wangye805 requested changes Dec 15, 2025

View reviewed changes

transformer_engine/jax/quantize/device_utils.py Show resolved Hide resolved

transformer_engine/pytorch/triton_kernels/rmsnorm.py Show resolved Hide resolved

tests/pytorch/test_fused_optimizer.py Outdated Show resolved Hide resolved

ipanfilo requested changes Dec 15, 2025

View reviewed changes

ipanfilo reviewed Dec 17, 2025

View reviewed changes

ci/ci_config.json Outdated Show resolved Hide resolved

tests/pytorch/test_fused_optimizer.py Show resolved Hide resolved

VeeraRajasekhar requested a review from wangye805 December 17, 2025 19:05

ipanfilo reviewed Dec 18, 2025

View reviewed changes

ipanfilo requested changes Dec 18, 2025

View reviewed changes

tests/pytorch/test_numerics.py Outdated Show resolved Hide resolved

tests/pytorch/test_numerics.py Outdated Show resolved Hide resolved

VeeraRajasekhar requested a review from ipanfilo December 18, 2025 17:48

ipanfilo reviewed Dec 19, 2025

View reviewed changes

VeeraRajasekhar requested a review from ipanfilo January 5, 2026 17:01

wangye805 requested changes Jan 5, 2026

View reviewed changes

tests/cpp/operator/test_cublaslt_gemm.cu Show resolved Hide resolved

tests/cpp/operator/test_cublaslt_gemm.cu Show resolved Hide resolved

tests/pytorch/test_fused_optimizer.py Outdated Show resolved Hide resolved

ipanfilo approved these changes Jan 8, 2026

View reviewed changes

wangye805 approved these changes Jan 8, 2026

View reviewed changes

VeeraRajasekhar added 13 commits January 8, 2026 19:21

[CI] Skipped test_gpt_full_activation_recompute tests for gfx950

96ef617

[CI] Skipped unsupported test_basic_linear_quantized tests on gfx950

3da9fb3

[CI] Fixed test_numerics, test_norms, test_fused_optimizer failures f…

2dcf0c5

…or gfx950 ci enablement

[CI] Disabled gfx950 support until FP8 GEMM layout coverage is verifi…

fb4590d

…ed with hipblaslt

[CI] [gfx950] Disable cudaGraph for gemmm and grouped-gemm

f4fa514

Addressed reviews

2396bb8

[CI] Add MI355 nodes to github actions workflow

ab8a390

[CI] Update docker image

d8da04e

[CI] add MI355 runner matrix and keep matrix legs independent

3bcee1f

Skip unstable Gemm tests on gfx950

d1894ef

Addressed reviews

b4d8c8f

Guard gfx950 TN skip by ROCm version and adjust MXFP8 Dq test size

8aca8e0

Removed ROCM7.2 guards

93c118b

Reverted ROCM7.2 guards

b551b3f

VeeraRajasekhar force-pushed the veergopu_gfx950_ci branch from 5a83295 to b551b3f Compare January 8, 2026 20:48

Update rocm-ci.yml

b52594b

VeeraRajasekhar merged commit f141f34 into dev Jan 9, 2026
2 of 4 checks passed


		#ifdef __HIP_PLATFORM_AMD__

		// Temporary skip: gfx950 TN kernels for (M,K,N)=(2304,768,4096) are unstable.

Enable gfx950 CI on dev branch #401

Enable gfx950 CI on dev branch #401

Uh oh!

Conversation

VeeraRajasekhar commented Dec 12, 2025

Description

Type of change

Changes

Checklist:

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

VeeraRajasekhar Dec 18, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

ipanfilo Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo Dec 19, 2025

Choose a reason for hiding this comment

Uh oh!

VeeraRajasekhar Jan 5, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

VeeraRajasekhar commented Jan 8, 2026

Uh oh!

VeeraRajasekhar commented Jan 9, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants