ROCm
diff --git a/‎.github/workflows/rocm-ci.yml‎
Lines changed: 7 additions & 3 deletions b/‎.github/workflows/rocm-ci.yml‎
Lines changed: 7 additions & 3 deletions
diff --git a/‎3rdparty/aiter‎ b/‎3rdparty/aiter‎
diff --git a/‎README.rst‎
Lines changed: 52 additions & 21 deletions b/‎README.rst‎
Lines changed: 52 additions & 21 deletions
diff --git a/‎ci/ci_config.json‎
Lines changed: 1 addition & 1 deletion b/‎ci/ci_config.json‎
Lines changed: 1 addition & 1 deletion
diff --git a/‎ci/pytorch.sh‎
Lines changed: 6 additions & 3 deletions b/‎ci/pytorch.sh‎
Lines changed: 6 additions & 3 deletions
diff --git a/‎tests/cpp/operator/test_cublaslt_gemm.cu‎
Lines changed: 16 additions & 4 deletions b/‎tests/cpp/operator/test_cublaslt_gemm.cu‎
Lines changed: 16 additions & 4 deletions
diff --git a/‎tests/cpp/test_common.h‎
Lines changed: 4 additions & 6 deletions b/‎tests/cpp/test_common.h‎
Lines changed: 4 additions & 6 deletions
diff --git a/‎tests/pytorch/test_fused_optimizer.py‎
Lines changed: 7 additions & 0 deletions b/‎tests/pytorch/test_fused_optimizer.py‎
Lines changed: 7 additions & 0 deletions
diff --git a/‎tests/pytorch/test_fusible_ops.py‎
Lines changed: 12 additions & 0 deletions b/‎tests/pytorch/test_fusible_ops.py‎
Lines changed: 12 additions & 0 deletions
diff --git a/‎tests/pytorch/test_numerics.py‎
Lines changed: 22 additions & 4 deletions b/‎tests/pytorch/test_numerics.py‎
Lines changed: 22 additions & 4 deletions
@@ -40,9 +40,13 @@ concurrency:
 
 jobs:
   build_and_test:
-    name: Build and Test on GPU
+    name: Build and Test on GPU (${{ matrix.runner }})
     timeout-minutes: 720
-    runs-on: linux-mi325-8
+    runs-on: ${{ matrix.runner }}
+    strategy:
+      fail-fast: false
+      matrix:
+        runner: [linux-mi325-8, linux-mi355-8]
     steps:
       - name: Checkout repository
         uses: actions/checkout@v4
@@ -422,7 +426,7 @@ jobs:
         if: always()
         uses: actions/upload-artifact@v4
         with:
-          name: logs-and-reports
+          name: logs-and-reports-${{ matrix.runner }}
           path: |
             *.log
           if-no-files-found: ignore
 
@@ -28,8 +28,60 @@ Feature Support Status
 Installation
 ============
 
+Install from manylinux wheels
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+Starting from ROCm 7.0, we provide manylinux wheels for Transformer Engine releases on `https://repo.radeon.com/rocm/manylinux`. For example, the wheels for ROCm 7.1.1 are at `https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/`. From the page, you can find four files related to Transformer Engine:
+
+* transformer_engine_rocm-*-py3-none-manylinux_2_28_x86_64.whl - This is the wheel file for installing the common library. It should not be installed by itself.
+* transformer_engine-*-py3-none-any.whl - This is the wheel file for installing the common TE Python package.
+* transformer_engine_jax-*.tar.gz - This is the source tar ball for the JAX extension.
+* transformer_engine_torch-*.tar.gz - This is the source tar ball for the Pytorch extension.
+
+Below are the example commands to download and install the wheels. They install both Pytorch and JAX extensions on the system where both frameworks are installed.
+
+.. code-block:: bash
+
+  wget https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/transformer_engine_rocm-2.2.0-py3-none-manylinux_2_28_x86_64.whl
+  wget https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/transformer_engine-2.2.0-py3-none-any.whl
+  wget https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/transformer_engine_jax-2.2.0.tar.gz
+  wget https://repo.radeon.com/rocm/manylinux/rocm-rel-7.1.1/transformer_engine_torch-2.2.0.tar.gz
+
+  pip install ./transformer_engine* --no-build-isolation
+
+Install TE from source
+^^^^^^^^^^^^^^^^^^
+
 Execute the following commands to install ROCm Transformer Engine from source on AMDGPUs:
 
+.. code-block:: bash
+
+  # Clone TE repo and submodules
+  git clone --recursive https://github.com/ROCm/TransformerEngine.git
+
+  cd TransformerEngine
+  export NVTE_FRAMEWORK=pytorch,jax #optionally set framework, currently only support pytorch and jax; if not set will try to detect installed frameworks
+  export NVTE_ROCM_ARCH=gfx942,gfx950 # gfx942 for support of MI300/MI325, and gfx950 for support of MI350
+
+  # Build Platform Selection (optional)
+  # Note: Useful when both ROCm and CUDA platforms are present in the Docker
+  export NVTE_USE_ROCM=1  #Use 1 for ROCm, or set to 0 to use CUDA; If not set will try to detect installed platform, prioritizing ROCm
+
+  pip install . --no-build-isolation
+
+It is also possible to build wheels for later installation with "pip wheel ." although those wheels will not be portable to systems with
+different libraries installed. If the build still fails with the "--no-build-isolation" flag try installing setuptools<80.0.0
+
+Note on Switching between Installation from Source and Installation from Wheels
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+Sometimes, issues might occur when installing from source on a system where a previous installation with wheels, or vice versa. It is safe to uninstall TE first before 
+switching between installing from source and installing from wheels. Here is the example command:
+
+.. code-block:: bash
+
+  # The package name pattern might be transformer_engine or transformer-engine depending on setuptools version
+  pip list | grep transformer.engine | xargs pip uninstall -y
+
 Known Issue with ROCm 6.4 PyTorch Release
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
@@ -57,27 +109,6 @@ Re-install PyTorch
   ./tools/amd_build/build_amd.py
   BUILD_TEST=0 python3 setup.py install
 
-Install TE
-^^^^^^^^^^^^^^^^^^
-
-.. code-block:: bash
-
-  # Clone TE repo and submodules
-  git clone --recursive https://github.com/ROCm/TransformerEngine.git
-
-  cd TransformerEngine
-  export NVTE_FRAMEWORK=pytorch,jax #optionally set framework, currently only support pytorch and jax; if not set will try to detect installed frameworks
-  export NVTE_ROCM_ARCH=gfx942 # CK fused attn only support MI200 and MI300 and fp8 features are only supported on MI300
-
-  # Build Platform Selection (optional)
-  # Note: Useful when both ROCm and CUDA platforms are present in the Docker
-  export NVTE_USE_ROCM=1  #Use 1 for ROCm, or set to 0 to use CUDA; If not set will try to detect installed platform, prioritizing ROCm
-
-  pip install --no-build-isolation .
-
-It is also possible to build wheels for later installation with "pip wheel ." although those wheels will not be portable to systems with
-different libraries installed. This build may also require "--no-build-isolation" and if the build still fails with this flag try installing setuptools<80.0.0
-
 Test
 ====
 
 
@@ -1,6 +1,6 @@
 {
   "docker_images": {
-    "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.0.2_ubuntu22.04_py3.10_pytorch_release-2.7_9015dfdf_jax_v0.6.0_fa-v2.8.0",
+    "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.1.1_ubuntu22.04_py3.11_pytorch_release_2.8_63e525b2_jax_0.7.1_fa-2.8.0",
     "release_v1.13": "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/private/te-ci:rocm-6.4_0_ubuntu22_py310_torch25_jax0435qa_fa273",
     "release_v1.14": "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/private/te-ci:rocm-6.4_0_ubuntu22_py310_torch25_jax0435qa_fa273"
   }
 
@@ -75,10 +75,13 @@ run_test_config(){
     NVTE_TEST_TRITON_AUTOTUNE=1 run_default_fa_lbl "autotune" 3 triton_kernels/test_norms.py
     run_default_fa 1 test_parallel_cross_entropy.py
     NVTE_USE_DEQUANTIZE_TRITON=1 NVTE_USE_CAST_TRANSPOSE_TRITON=1 NVTE_USE_RMSNORM_TRITON=1 NVTE_USE_LAYERNORM_TRITON=1 run_default_fa_lbl "triton" 3 test_numerics.py
-    NVTE_USE_RMSNORM_TRITON=1 run_default_fa_lbl "triton" 1 test_fusible_ops.py
+    NVTE_USE_CAST_TRANSPOSE_TRITON=1 NVTE_USE_RMSNORM_TRITON=1 run_default_fa_lbl "triton" 1 test_fusible_ops.py
     NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa_lbl "triton" 1 test_float8_current_scaling_exact.py
-    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_numerics.py
-    NVTE_USE_ATOMIC_AMAX=1 run_default_fa 3 test_fusible_ops.py
+    NVTE_USE_ATOMIC_AMAX=1 run_default_fa_lbl "amax" 3 test_numerics.py
+    NVTE_USE_ATOMIC_AMAX=1 run_default_fa_lbl "amax" 3 test_fusible_ops.py
+    NVTE_USE_ATOMIC_AMAX=1 NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa_lbl "amax+triton" 3 test_numerics.py
+    NVTE_USE_ATOMIC_AMAX=1 NVTE_USE_CAST_TRANSPOSE_TRITON=1 run_default_fa_lbl "amax+triton" 3 test_fusible_ops.py
+    NVTE_USE_ATOMIC_AMAX=1 run_default_fa_lbl "amax" 3 triton_kernels/test_cast.py
 }
 
 run_test_config_mgpu(){
 
@@ -29,8 +29,8 @@ std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes = {
 }; 
 
 std::vector<std::tuple<size_t, size_t, size_t>> test_case_sizes_mxfp8 = {
-  {2304, 768, 4096},
-}; 
+  {768, 3072, 4096},
+};
 
 //  A, B, Bias, Gelu, D
 //  Bias type choose as bf16 in use_fp8, D_type otherwise
@@ -228,6 +228,14 @@ void performTest(const TestParams& params) {
 
 #ifdef __HIP_PLATFORM_AMD__
 
+  #if HIP_VERSION < 70200000
+    if (prop.major == 9 && prop.minor == 5 &&
+        params.transa && !params.transb &&
+        params.m == 2304 && params.k == 768 && params.n == 4096) {
+      GTEST_SKIP() << "Skip TN 2304x768x4096 on gfx950 for ROCm < 7.2";
+    }
+  #endif
+
   // Enable FP8 GEMM + GELU fusion tests only on MI300 (gfx942) with ROCm > 7.0.
   // hipBLASLt currently supports this config only
   bool fp8_gelu_fusion_config = false;
@@ -287,11 +295,15 @@ void performTest(const TestParams& params) {
   }
   if (prop.major == 9 && prop.minor == 4) //gfx942 specific hipblasLt limitations
   {
+#if HIP_VERSION < 70100000
     if (params.use_gelu && dtype == DType::kBFloat16 && !params.transa) {
       GTEST_SKIP() << "BF16 GEMM with GELU is not supported in current config";
     }
-    if (has_fp8 && params.use_bias && dtype == DType::kFloat8E4M3 && !fp8_gelu_fusion_config) {
-      GTEST_SKIP() << "FP8 GEMM with bias and FP8 output is not supported in current config";
+#endif
+    if constexpr (std::is_same<D_Type, fp8>::value && std::is_same<Bias_Type, bf16>::value) {
+      if (params.use_bias && !fp8_gelu_fusion_config) {
+        GTEST_SKIP() << "GEMM with BF16 bias and FP8 output is not supported in current config";
+      }
     }
   }
 #endif
 
@@ -344,20 +344,18 @@ struct Numeric_Traits<fp8e4m3> {
     static constexpr double minSubnorm = 1.0   / static_cast<double>(1 << 9);   // std::pow(2.0, -9.0);
     static constexpr double maxSubnorm = 0.875 / static_cast<double>(1 << 6);   // std::pow(2.0, -6.0);
     static constexpr double minNorm    = 1.0   / static_cast<double>(1 << 6);   // std::pow(2.0, -6.0);
-    #ifndef USE_ROCM
+#ifndef USE_ROCM
     static constexpr double maxNorm    = 448.0;
-    #elif HIP_VERSION >= 60300000
+#else
     static const double maxNorm;
-    #else
-    static constexpr double maxNorm = 240.0;
-    #endif //USE_ROCM
+#endif //USE_ROCM
     static const double artifInf;                        // artificial Infinity
     static constexpr int maxBiasedExponentAsFP32 = 8 + FP32_EXPONENT_BIAS;
     static constexpr int maxUnbiasedExponentAsFP32 = 8;
     static constexpr int maxExpNorm    = 1 << maxUnbiasedExponentAsFP32;
 };
 
-#if defined(USE_ROCM) && (HIP_VERSION >= 60300000)
+#ifdef USE_ROCM
 inline const double Numeric_Traits<fp8e4m3>::maxNorm = te_fp8_fnuz() ? 240.0 : 448.0;
 #endif
 
 
@@ -1,3 +1,5 @@
+# This file was modified for portability to AMDGPU
+# Copyright (c) 2025, Advanced Micro Devices, Inc. All rights reserved.
 # Copyright (c) 2022-2025, NVIDIA CORPORATION & AFFILIATES. All rights reserved.
 #
 # See LICENSE for license information.
@@ -8,6 +10,7 @@
 import pytest
 import torch
 from torch import nn
+from torch.utils.cpp_extension import IS_HIP_EXTENSION
 from torch.testing._internal.common_device_type import largeTensorTest
 import transformer_engine.pytorch as te
 from transformer_engine.common.recipe import DelayedScaling
@@ -16,6 +19,7 @@
 from transformer_engine.pytorch.utils import is_bf16_compatible
 from transformer_engine.pytorch.fp8 import FP8GlobalStateManager
 from transformer_engine.pytorch.utils import gpu_autocast_ctx
+from transformer_engine.pytorch.utils import get_device_compute_capability
 
 # Check if FP8 is supported
 fp8_available, reason_for_no_fp8 = FP8GlobalStateManager.is_fp8_available()
@@ -370,6 +374,7 @@ def test_bf16_exp_avg(self):
     @pytest.mark.skipif(not is_bf16_compatible(), reason="bf16 if not supported")
     @pytest.mark.skipif(not fp8_available, reason=reason_for_no_fp8)
     def test_fp8_exp_avg(self):
+        model_tol = 3e-2 if IS_HIP_EXTENSION and get_device_compute_capability() == (9, 5) else None
         self.gen_precision_aware_test(
             use_fp8_params=False,
             param_dtype=torch.bfloat16,
@@ -380,6 +385,8 @@ def test_fp8_exp_avg(self):
             exp_avg_sq_dtype=torch.float32,
             master_rtol=1e-2,
             master_atol=1e-2,
+            model_rtol=model_tol,
+            model_atol=model_tol,
         )
 
     @pytest.mark.skipif(not is_bf16_compatible(), reason="bf16 if not supported")
 
@@ -38,7 +38,9 @@
 )
 from transformer_engine.pytorch.tensor.mxfp8_tensor import MXFP8Tensor, MXFP8Quantizer
 from transformer_engine.pytorch.utils import is_bf16_compatible
+from transformer_engine.pytorch.utils import get_device_compute_capability
 import transformer_engine_torch as tex
+from torch.utils.cpp_extension import IS_HIP_EXTENSION
 
 # Import utility functions
 from utils import dtype_tols, make_recipe, reset_rng_states
@@ -971,6 +973,16 @@ def test_basic_linear_quantized(
         quantized_grad_input: bool,
     ) -> None:
         """GEMM with FP8 inputs and outputs"""
+        if IS_HIP_EXTENSION and get_device_compute_capability() == (9, 5):
+            if (
+                quantization
+                and quantization.startswith("fp8")
+                and quantized_compute
+                and (quantized_grad_input or quantized_output)
+            ):
+                pytest.skip(
+                    "hipBLASLt does not provide suitable algorithms on gfx950 for this config."
+                )
         if quantization is None:
             pytest.skip("Skipping case without quantization")
         self._test_basic_linear(
 
@@ -768,6 +768,13 @@ def test_gpt_full_activation_recompute(
 ):
     if fp8_model_params and NVTE_TEST_NVINSPECT_ENABLED:
         pytest.skip("FP8 parameters are not supported in debug mode.")
+    if IS_HIP_EXTENSION and get_device_compute_capability() == (9, 5):
+        if (dtype == torch.bfloat16
+            and not fp8
+            and not use_reentrant
+            and recipe.float8_per_tensor_scaling()
+            ):
+            pytest.skip("hipBLASLt does not provide suitable algorithms on GFX950 for this config.")
 
     config = model_configs[model]
     torch.compiler.reset() # avoid cache size limit overflow
@@ -2829,10 +2836,21 @@ def test_transformer_layer_hidden_states_format(dtype, bs, model):
             max_seqlen_kv=config.max_seqlen_kv,
         )
 
-        torch.testing.assert_close(
-            y_bshd,
-            y_thd.reshape(bs, config.max_seqlen_q, config.hidden_size).contiguous(),
-        )
+        if IS_HIP_EXTENSION and get_device_compute_capability() == (9, 5):
+            tols_thd = dtype_tols(dtype)
+            # On gfx950 the results for THD are different
+            # that results in lower final result precision
+            tols_thd["atol"] = 2e-3
+            torch.testing.assert_close(
+                y_bshd,
+                y_thd.reshape(bs, config.max_seqlen_q, config.hidden_size).contiguous(),
+                **tols_thd,
+            )
+        else:
+            torch.testing.assert_close(
+                y_bshd,
+                y_thd.reshape(bs, config.max_seqlen_q, config.hidden_size).contiguous(),
+            )
 
 
 @pytest.mark.parametrize(
Original file line number	Diff line number	Diff line change
`@@ -1,6 +1,6 @@`
`1`	`1`	`{`
`2`	`2`	`"docker_images": {`
`3`		`- "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.0.2_ubuntu22.04_py3.10_pytorch_release-2.7_9015dfdf_jax_v0.6.0_fa-v2.8.0",`
	`3`	`+ "default": "registry-sc-harbor.amd.com/framework/te-ci:rocm-7.1.1_ubuntu22.04_py3.11_pytorch_release_2.8_63e525b2_jax_0.7.1_fa-2.8.0",`
`4`	`4`	`"release_v1.13": "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/private/te-ci:rocm-6.4_0_ubuntu22_py310_torch25_jax0435qa_fa273",`
`5`	`5`	`"release_v1.14": "compute-artifactory.amd.com:5000/rocm-plus-docker/framework/private/te-ci:rocm-6.4_0_ubuntu22_py310_torch25_jax0435qa_fa273"`
`6`	`6`	`}`