CI: GitHub Action migration from Jenkins CI #322
Conversation
.github/workflows/rocm-ci.yml
Outdated
|
|
||
| - name: Cleanup container | ||
| if: always() | ||
| run: docker rm -f te-runner || true |
There was a problem hiding this comment.
should container rather be run with --rm?
There was a problem hiding this comment.
Usually, yes, but this workflow relies on starting the container and then running commands inside it with docker exec. Having an always step will remove the container even if the job failed
There was a problem hiding this comment.
I think docker exec works only with started container. Original docker run command starts docker with default entry point (bash?) and it remains started till entry point app is completed or docker stop is called. In your case you just remove it w/o stopping. So, yes, if call with --rm, it will require explicit docker stop call
There was a problem hiding this comment.
BTW, if run docker with something sh -c "sleep XXX" and run actual tests in docker exec, CI won't run longer than XX second, which helps to mitigate tests hangs.
There was a problem hiding this comment.
This is an interesting point, essentially, we'll create a timeout for a container. Though there's a GHA-native way using timeout-minutes with a step I am planning to implement, we can use it in the future when we need multiple timeouts inside one step
* Remove HIP macros around std:: math functions * hipify-torch commit
* Implement prebuilt AITER binary download system via NVTE_AITER_PREBUILT_BASE_URL * Add caching based on ROCm version and AITER commit SHA * Implement SHA256 verification for downloaded prebuilt files * Add automatic fallback to source build when prebuilt files unavailable * Add cache validation to detect and clean outdated/invalid builds * Implement upload file generation script for creating prebuilt files * Addressed Reviews * Fix git safe directory issues for AITER submodule in Docker environments
…#355) * Expanded catch categories * Improved errors * Converted assertion error to FileNotFound error for clarity
Update 3rdparty/aiter commit which adds FMHAv3 backward support for MLA head_dim_qk=192, head_dim_v=128 configuration. - Add MLA test configs mla_4_0 and mla_4_1 for HD192_HD128 - Fix is_training logic to enable backward pass testing for HD192_HD128 * Added Jax fused_attn tests for MLA HD192+HD128 for bhsd * Addressed reviews
* Enable aligned vectorized memory ops for MXFP8 cast * Optimized vector sizes and alignment conditions
* Initial commit * Removed Print statements, added keep_fp8_transpose cache integration with fsdp2 * Added use_fsdp flag to Linear module, added profile code, added test code, added all reduce for amax * Fixed unit test * Removing all reduce code for amax since by default TE does all reduce when torch.distributed is initialized. * reverting case where out is already present * Added unit test with regualr sgpu training * Modified unit test to compare FSDP2 with DDP * bug fixes * Code cleaning up * Initial commit to add MXFP8 * Added fp8 current scaling. * Added MXFP8, Modified unit test to run based on recipes * Extended use_fsdp to layernorm linear and layernorm mlp * Moved amax reduce from forward to backward for fsdp2 * Added automatic detection of use fsdp from base module * Use SKIP_FP8_REDUCTION_FOR_FSDP2 in backward for check if need to do forward reduce * Added memory profile code, added a check before setting SKIP_FP8_REDUCTION_FOR_FSDP2 * Fix for fused optimizer, changed _elem to _data, code clean up * Fixed layernorm mlp * Code cleanup and added test to pytorch.sh * Removed whitespaces * Fixed comments and license * Added guards * Added reduce for forward in cuda graph backward, added code to remove test artifacts, reverted upstream test file --------- Co-authored-by: sudhu2k <sugovind@amd.com>
* Restructured benchmark script, added new test * Added tflops calculation * Added casual factor in tflops calculation * Added fwd_v3 argument to ci script
…formerEngine into leo/migrate-ci-to-gha
This PR ports the existing Jenkins CI pipeline to GitHub Actions, aiming for functional parity while consolidating CI on GitHub.
Approach
preparejob (build-only runner): Pulls the Docker image from the internal registry and saves it as a chunked.tarartifact due to network restrictions on GPU runners. Will be moved to a GPU runner once we have some in internal network.build_and_testjob (GPU runner): Downloads the artifact, loads the image into the node's Docker cache, and then starts a container. All subsequent build (pip wheel+pip install) and test steps run inside this container usingdocker exec, closely mirroring the single-agent Jenkins flow.sGPUtests) as the original Jenkinsfile.This migration focuses on directly replicating the existing CI logic in GitHub Actions.