Skip to content

Conversation

@ko3n1g
Copy link
Contributor

@ko3n1g ko3n1g commented Aug 29, 2025

Description

This PR adds CI workflows to build and attach PyTorch wheels to a release page of TE. Thus, the workflows are triggered on release creation. Manual runs are also possible via dispatch event.

By default a hardcoded torch and latest 5 NGC PyT versions are chosen.

it also fixed an issue where the existing build workflows ran out of storage.

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

@timmoon10 timmoon10 self-requested a review October 6, 2025 18:02
@timmoon10 timmoon10 self-requested a review October 10, 2025 23:29
@ko3n1g
Copy link
Contributor Author

ko3n1g commented Oct 13, 2025

Thanks for your review @timmoon10 . The suggestions all make sense to me. I'm still syncing with @ksivaman on some details and hope we can hash it out end of the week

@greptile-apps
Copy link
Contributor

greptile-apps bot commented Nov 13, 2025

Greptile Overview

Greptile Summary

This PR introduces automated CI infrastructure to build and distribute PyTorch wheels for TransformerEngine releases. The changes implement a new GitHub Actions workflow that can be triggered on release creation or manually via dispatch events. The solution builds wheels against multiple PyTorch/CUDA/Python version combinations using both standard Docker containers and NGC (NVIDIA GPU Cloud) images to cover the last 5 months of available versions.

The implementation includes a new composite GitHub Action (build-pytorch-wheel) that handles containerized wheel building with disk space optimization, a shell script to dynamically discover available NGC images, and updates to the existing build workflow to address storage limitations. The PyTorch setup script was modified to support NGC container environments and drop CUDA 11 support in favor of CUDA 12/13 only. Additionally, the changes include standard Python development practices like adding .venv to .gitignore.

Important Files Changed

Filename Score Overview
.github/workflows/attach-wheels-to-release.yml 1/5 New workflow for automated wheel building with critical issues including undefined job references, deprecated actions, and logical errors
.github/actions/build-pytorch-wheel/action.yml 1/5 New composite action for wheel building with exit code masking, missing outputs, and unreachable cleanup code
.github/actions/build-pytorch-wheel/build.sh 1/5 Build script that captures but ignores exit codes, causing failed builds to appear successful
.github/scripts/check_for_ngc_images.sh 2/5 NGC image discovery script with macOS compatibility issues and error handling problems
.github/workflows/build.yml 2/5 Refactored build workflow addressing disk space issues but introducing manual container management complexity
.github/actions/build-pytorch-wheel/Dockerfile 3/5 Docker container setup with complex version mapping logic that may be fragile
transformer_engine/pytorch/setup.py 4/5 Updated wheel URL construction with improved NGC support and cleaner code structure
.gitignore 5/5 Standard addition of .venv to exclude Python virtual environments

Confidence score: 1/5

  • This PR requires extensive review and testing before merging due to multiple critical implementation issues that will prevent proper functionality
  • Score reflects serious problems with error handling, deprecated action usage, undefined references, and logical inconsistencies that would cause CI failures
  • Pay close attention to the workflow files, composite action, and build scripts which contain fundamental flaws that need correction

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 6 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 5 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

6 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Additional Comments (1)

  1. .github/workflows/attach-wheels-to-release.yml, line 152-170 (link)

    logic: Incomplete job - builds wheels for NGC images but never uploads them to the release. This job will build wheels and then discard them. Should add upload step similar to lines 141-150.

    Should this job also upload the NGC-built wheels to the release, or are they intended for a different purpose?

7 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

7 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format

@ko3n1g ko3n1g requested a review from ksivaman November 18, 2025 14:46
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, 2 comments

Edit Code Review Agent Settings | Greptile

Comment on lines +64 to +66
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Container name 'builder' could conflict if multiple jobs run simultaneously - consider using unique names per job

Suggested change
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder-pytorch -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity

Are these jobs guaranteed to run sequentially, or could they run in parallel causing container name conflicts?

Comment on lines +133 to +135
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

logic: Same container name 'builder' used in multiple jobs - will cause conflicts if jobs run in parallel

Suggested change
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity
- name: Start named container
run: |
docker run -v $(pwd):$(pwd) -w $(pwd) --name builder-all -d ghcr.io/nvidia/jax:jax sleep infinity

Signed-off-by: oliver könig <okoenig@nvidia.com>
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

@ksivaman
Copy link
Member

/te-ci pytorch

ksivaman
ksivaman previously approved these changes Nov 21, 2025
Copy link
Member

@ksivaman ksivaman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Most review was conducted offline. For release v2.10 we will release as is. For release 2.11, we plan some generalizations of the pytorch wheel.

Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
@ksivaman ksivaman merged commit 632c4c3 into NVIDIA:main Nov 21, 2025
10 of 12 checks passed
Copy link
Contributor

@greptile-apps greptile-apps bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

8 files reviewed, no comments

Edit Code Review Agent Settings | Greptile

ko3n1g added a commit to ko3n1g/TransformerEngine that referenced this pull request Nov 24, 2025
* ci: Build and attach bdist wheels to release page

Signed-off-by: oliver könig <okoenig@nvidia.com>

* free up space

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanup

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Reduce months to check from 7 to 5

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/scripts/check_for_ngc_images.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update .github/actions/build-pytorch-wheel/build.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
KshitijLakhani pushed a commit that referenced this pull request Nov 27, 2025
* ci: Build and attach bdist wheels to release page

Signed-off-by: oliver könig <okoenig@nvidia.com>

* free up space

Signed-off-by: oliver könig <okoenig@nvidia.com>

* cleanup

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* test

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* fix

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* c28619d8999a147d5e09c1199f84ff6af6ad5794

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Reduce months to check from 7 to 5

Signed-off-by: oliver könig <okoenig@nvidia.com>

* Update .github/scripts/check_for_ngc_images.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

* Update .github/actions/build-pytorch-wheel/build.sh

Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>

---------

Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants