-
Notifications
You must be signed in to change notification settings - Fork 611
ci: Build and attach bdist wheels to release page #2138
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Thanks for your review @timmoon10 . The suggestions all make sense to me. I'm still syncing with @ksivaman on some details and hope we can hash it out end of the week |
Greptile OverviewGreptile SummaryThis PR introduces automated CI infrastructure to build and distribute PyTorch wheels for TransformerEngine releases. The changes implement a new GitHub Actions workflow that can be triggered on release creation or manually via dispatch events. The solution builds wheels against multiple PyTorch/CUDA/Python version combinations using both standard Docker containers and NGC (NVIDIA GPU Cloud) images to cover the last 5 months of available versions. The implementation includes a new composite GitHub Action ( Important Files Changed
Confidence score: 1/5
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 6 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
4 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 5 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
5 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
6 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Additional Comments (1)
-
.github/workflows/attach-wheels-to-release.yml, line 152-170 (link)logic: Incomplete job - builds wheels for NGC images but never uploads them to the release. This job will build wheels and then discard them. Should add upload step similar to lines 141-150.
Should this job also upload the NGC-built wheels to the release, or are they intended for a different purpose?
7 files reviewed, 2 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, 1 comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
7 files reviewed, no comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
Edit Code Review Agent Settings | Greptile
React with 👍 or 👎 to share your feedback on this new summary format
Signed-off-by: oliver könig <okoenig@nvidia.com>
Signed-off-by: oliver könig <okoenig@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, 2 comments
| - name: Start named container | ||
| run: | | ||
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Container name 'builder' could conflict if multiple jobs run simultaneously - consider using unique names per job
| - name: Start named container | |
| run: | | |
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity | |
| - name: Start named container | |
| run: | | |
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder-pytorch -d nvcr.io/nvidia/cuda:12.8.0-devel-ubuntu22.04 sleep infinity |
Are these jobs guaranteed to run sequentially, or could they run in parallel causing container name conflicts?
| - name: Start named container | ||
| run: | | ||
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logic: Same container name 'builder' used in multiple jobs - will cause conflicts if jobs run in parallel
| - name: Start named container | |
| run: | | |
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder -d ghcr.io/nvidia/jax:jax sleep infinity | |
| - name: Start named container | |
| run: | | |
| docker run -v $(pwd):$(pwd) -w $(pwd) --name builder-all -d ghcr.io/nvidia/jax:jax sleep infinity |
Signed-off-by: oliver könig <okoenig@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
|
/te-ci pytorch |
ksivaman
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Most review was conducted offline. For release v2.10 we will release as is. For release 2.11, we plan some generalizations of the pytorch wheel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
8 files reviewed, no comments
* ci: Build and attach bdist wheels to release page Signed-off-by: oliver könig <okoenig@nvidia.com> * free up space Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanup Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * Reduce months to check from 7 to 5 Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/scripts/check_for_ngc_images.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update .github/actions/build-pytorch-wheel/build.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
* ci: Build and attach bdist wheels to release page Signed-off-by: oliver könig <okoenig@nvidia.com> * free up space Signed-off-by: oliver könig <okoenig@nvidia.com> * cleanup Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * test Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * fix Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * c28619d8999a147d5e09c1199f84ff6af6ad5794 Signed-off-by: oliver könig <okoenig@nvidia.com> * Reduce months to check from 7 to 5 Signed-off-by: oliver könig <okoenig@nvidia.com> * Update .github/scripts/check_for_ngc_images.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> * Update .github/actions/build-pytorch-wheel/build.sh Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> --------- Signed-off-by: oliver könig <okoenig@nvidia.com> Signed-off-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com> Co-authored-by: Kirthi Shankar Sivamani <ksivamani@nvidia.com>
Description
This PR adds CI workflows to build and attach PyTorch wheels to a release page of TE. Thus, the workflows are triggered on release creation. Manual runs are also possible via dispatch event.
By default a hardcoded torch and latest 5 NGC PyT versions are chosen.
it also fixed an issue where the existing build workflows ran out of storage.
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: