CI: GitHub Action migration from Jenkins CI by leo-amd · Pull Request #322 · ROCm/TransformerEngine

leo-amd · 2025-09-26T12:12:47Z

This PR ports the existing Jenkins CI pipeline to GitHub Actions, aiming for functional parity while consolidating CI on GitHub.

Approach

prepare job (build-only runner): Pulls the Docker image from the internal registry and saves it as a chunked .tar artifact due to network restrictions on GPU runners. Will be moved to a GPU runner once we have some in internal network.
build_and_test job (GPU runner): Downloads the artifact, loads the image into the node's Docker cache, and then starts a container. All subsequent build (pip wheel + pip install) and test steps run inside this container using docker exec, closely mirroring the single-agent Jenkins flow.
Test execution uses the same scripts and structure (including parallel sGPU tests) as the original Jenkinsfile.

This migration focuses on directly replicating the existing CI logic in GitHub Actions.

.github/workflows/rocm-ci.yml

.github/workflows/transformer-engine-ci.yml

ipanfilo · 2025-10-21T18:16:33Z

.github/workflows/rocm-ci.yml

+
+      - name: Cleanup container
+        if: always()
+        run: docker rm -f te-runner || true


should container rather be run with --rm?

Usually, yes, but this workflow relies on starting the container and then running commands inside it with docker exec. Having an always step will remove the container even if the job failed

I think docker exec works only with started container. Original docker run command starts docker with default entry point (bash?) and it remains started till entry point app is completed or docker stop is called. In your case you just remove it w/o stopping. So, yes, if call with --rm, it will require explicit docker stop call

BTW, if run docker with something sh -c "sleep XXX" and run actual tests in docker exec, CI won't run longer than XX second, which helps to mitigate tests hangs.

This is an interesting point, essentially, we'll create a timeout for a container. Though there's a GHA-native way using timeout-minutes with a step I am planning to implement, we can use it in the future when we need multiple timeouts inside one step

.github/workflows/rocm-ci.yml

* Remove HIP macros around std:: math functions * hipify-torch commit

* Implement prebuilt AITER binary download system via NVTE_AITER_PREBUILT_BASE_URL * Add caching based on ROCm version and AITER commit SHA * Implement SHA256 verification for downloaded prebuilt files * Add automatic fallback to source build when prebuilt files unavailable * Add cache validation to detect and clean outdated/invalid builds * Implement upload file generation script for creating prebuilt files * Addressed Reviews * Fix git safe directory issues for AITER submodule in Docker environments

…#355) * Expanded catch categories * Improved errors * Converted assertion error to FileNotFound error for clarity

Update 3rdparty/aiter commit which adds FMHAv3 backward support for MLA head_dim_qk=192, head_dim_v=128 configuration. - Add MLA test configs mla_4_0 and mla_4_1 for HD192_HD128 - Fix is_training logic to enable backward pass testing for HD192_HD128 * Added Jax fused_attn tests for MLA HD192+HD128 for bhsd * Addressed reviews

* Enable aligned vectorized memory ops for MXFP8 cast * Optimized vector sizes and alignment conditions

* Initial commit * Removed Print statements, added keep_fp8_transpose cache integration with fsdp2 * Added use_fsdp flag to Linear module, added profile code, added test code, added all reduce for amax * Fixed unit test * Removing all reduce code for amax since by default TE does all reduce when torch.distributed is initialized. * reverting case where out is already present * Added unit test with regualr sgpu training * Modified unit test to compare FSDP2 with DDP * bug fixes * Code cleaning up * Initial commit to add MXFP8 * Added fp8 current scaling. * Added MXFP8, Modified unit test to run based on recipes * Extended use_fsdp to layernorm linear and layernorm mlp * Moved amax reduce from forward to backward for fsdp2 * Added automatic detection of use fsdp from base module * Use SKIP_FP8_REDUCTION_FOR_FSDP2 in backward for check if need to do forward reduce * Added memory profile code, added a check before setting SKIP_FP8_REDUCTION_FOR_FSDP2 * Fix for fused optimizer, changed _elem to _data, code clean up * Fixed layernorm mlp * Code cleanup and added test to pytorch.sh * Removed whitespaces * Fixed comments and license * Added guards * Added reduce for forward in cuda graph backward, added code to remove test artifacts, reverted upstream test file --------- Co-authored-by: sudhu2k <sugovind@amd.com>

* Restructured benchmark script, added new test * Added tflops calculation * Added casual factor in tflops calculation * Added fwd_v3 argument to ci script

…formerEngine into leo/migrate-ci-to-gha

leo-amd requested review from ipanfilo, wangye805 and wenchenvincent as code owners September 26, 2025 12:12

ipanfilo requested changes Oct 21, 2025

View reviewed changes

leo-amd added 26 commits October 23, 2025 16:24

First drawt of the workflow

87d437e

Update transformer-engine-ci.yml

e5c86af

Fixes

335b09f

Update transformer-engine-ci.yml

8136739

Update transformer-engine-ci.yml

440a781

toLower

578ddce

CI fixes

99863b1

Added build-only runners

b984353

Debug docker credentials

63d5046

Debug docker credentials

370ff19

Node js is missing on the node

4981dc1

Remove devcontainer

c96d76f

More fixes

82ea4e5

Typo

bae63ab

Permissions inside the container

29955e9

sudo

227ca35

Update transformer-engine-ci.yml

9fc2342

Update transformer-engine-ci.yml

6d3a025

Update transformer-engine-ci.yml

19a78cd

Update transformer-engine-ci.yml

93a42ee

Update transformer-engine-ci.yml

40f8d08

Update transformer-engine-ci.yml

d050431

Update transformer-engine-ci.yml

5aeaf38

Update transformer-engine-ci.yml

53633e5

Update transformer-engine-ci.yml

8989cd5

Update transformer-engine-ci.yml

06c3c30

ipanfilo approved these changes Nov 1, 2025

View reviewed changes

leo-amd and others added 27 commits November 3, 2025 17:18

Run CI

22815fa

Runner update

2eb4714

Update runners

dabb812

Debug

d8beee1

Debug

5a7c74d

Remove HIP macros around std:: math functions (#343)

83735d8

* Remove HIP macros around std:: math functions * hipify-torch commit

Run CI

a007240

Update runners label

b4da0d5

std::max type mismatch hotfix (#361)

96d48ce

CI: allow numpy 2.0 (#366)

6b8a47d

Relax tolerance to pass 29x29x17389NT GEMM on MI350 (#365)

9a987f8

FIX Occasional import error when only building for a single framework (…

90c04bc

…#355) * Expanded catch categories * Improved errors * Converted assertion error to FileNotFound error for clarity

Use .info/version for ROCm verison (#368)

e9c7361

Enable aligned vectorized memory ops for MXFP8 cast (#342)

87fece2

* Enable aligned vectorized memory ops for MXFP8 cast * Optimized vector sizes and alignment conditions

[ROCm] align the softmax aux shape with NVTE upstream (#371)

32e2d1d

Test CI plus small changes

3a6c5b6

Run CI

ba31fd8

Update runners label

956fa26

Update runners

cd612d7

[TE] Implement Triton current scaling (#341)

6bbd03c

Update benchmark script to support fwd_v3 and a16 (#373)

c95f9db

* Restructured benchmark script, added new test * Added tflops calculation * Added casual factor in tflops calculation * Added fwd_v3 argument to ci script

Enable AITER V3 kernels by default (#372)

9eaaf4c

Add new logic from Jenkins and continue-on-error: true for tests

031d73b

Merge branch 'leo/migrate-ci-to-gha' of https://github.com/ROCm/Trans…

94174a4

…formerEngine into leo/migrate-ci-to-gha

leo-amd closed this Nov 19, 2025

leo-amd deleted the leo/migrate-ci-to-gha branch November 19, 2025 15:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CI: GitHub Action migration from Jenkins CI #322

CI: GitHub Action migration from Jenkins CI #322
leo-amd wants to merge 120 commits intodevfrom
leo/migrate-ci-to-gha

leo-amd commented Sep 26, 2025 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo Oct 21, 2025

Uh oh!

leo-amd Oct 22, 2025

Uh oh!

ipanfilo Oct 22, 2025

Uh oh!

ipanfilo Oct 26, 2025

Uh oh!

leo-amd Oct 29, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

Conversation

leo-amd commented Sep 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Approach

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

ipanfilo Oct 21, 2025

Choose a reason for hiding this comment

Uh oh!

leo-amd Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo Oct 22, 2025

Choose a reason for hiding this comment

Uh oh!

ipanfilo Oct 26, 2025

Choose a reason for hiding this comment

Uh oh!

leo-amd Oct 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

leo-amd commented Sep 26, 2025 •

edited

Loading