-
Notifications
You must be signed in to change notification settings - Fork 621
Fix Github workflows issues #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Signed-off-by: Pawel Gadzinski <[email protected]>
Greptile OverviewGreptile SummaryThis PR resolves three critical CI/CD issues:
All four build jobs (core, pytorch, jax, all) now use consistent CUDA 13.0.0 containers with properly configured dependency installations. The Confidence Score: 4/5
Important Files Changed
Sequence DiagramsequenceDiagram
participant GHA as GitHub Actions
participant OLD as Old Build (Docker-in-Docker)
participant NEW as New Build (Direct Container)
participant Build as Build Process
Note over GHA,Build: Old Architecture (Docker-in-Docker)
GHA->>OLD: Start ubuntu-latest runner
OLD->>OLD: maximize-build-space action
OLD->>OLD: docker run builder container
OLD->>OLD: docker exec (MAX_JOBS not propagated)
OLD->>Build: pip install (OOM due to no MAX_JOBS limit)
Note over GHA,Build: New Architecture (Direct Container)
GHA->>NEW: Start with CUDA 13.0 container
NEW->>NEW: apt-get install dependencies
NEW->>NEW: git config safe.directory
NEW->>NEW: MAX_JOBS=1 in env (properly propagated)
NEW->>Build: pip install (memory controlled)
Build-->>NEW: Success (no OOM)
Note over GHA,Build: Documentation Deployment
GHA->>GHA: upload-pages-artifact@v3 (was v1)
GHA->>GHA: deploy-pages@v4 (was v2)
GHA->>GHA: Fixes compatibility issues
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1 file reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
.github/workflows/build.yml
Outdated
| root-reserve-mb: 4096 | ||
| temp-reserve-mb: 32 | ||
| swap-size-mb: 10240 | ||
| swap-size-mb: 4096 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Verify that reduced memory allocation (root: 5120→4096 MB, swap: 10240→4096 MB) is sufficient for PyTorch builds to avoid OOM issues.
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||
| pip install torch --no-cache-dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 which was present before. This may install CPU-only PyTorch instead of the CUDA version needed for testing.
| pip install torch --no-cache-dir | |
| pip install torch --no-cache-dir --index-url https://download.pytorch.org/whl/cu130 |
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install cmake==3.21.0 pybind11[global] einops onnxscript | ||
| pip install torch --no-cache-dir |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 which was present before. This may install CPU-only PyTorch instead of the CUDA version.
| pip install torch --no-cache-dir | |
| pip install torch --no-cache-dir --index-url https://download.pytorch.org/whl/cu130 |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 when installing PyTorch. Without this, pip will install the default PyTorch from PyPI, which may be CPU-only or have incompatible CUDA version. This was present in the base commit for the all job and is needed here too.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 when installing PyTorch. The base commit explicitly used this index URL to ensure CUDA 13.0 support matching the JAX container (see commit 4cf2f12). Without it, the default PyPI version will be installed, which may be CPU-only.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. The JAX container has CUDA 13.0, and commit 4cf2f12 explicitly added this index URL to ensure compatibility. Without it, pip may install a CPU-only or incompatible CUDA version from PyPI.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match the CUDA 13.0 version in the JAX container. Without it, the default PyPI version will be installed, which may be CPU-only or have incompatible CUDA version.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match CUDA 13.0 in the JAX container. Without it, pip installs the default PyPI version (likely CPU-only or wrong CUDA version).
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was present in the base commit and is required to match CUDA 13.0 in the JAX container (see commit 4cf2f12).
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. The JAX container has CUDA 13.0, but without this flag pip installs the default PyPI version (likely CPU-only or wrong CUDA version). This was explicitly added in commit 4cf2f12 (#2308) for this exact reason.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
.github/workflows/build.yml
Outdated
| - name: 'Dependencies' | ||
| run: | | ||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] einops onnxscript | ||
| pip install --no-cache-dir torch |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. Without this, pip installs the default PyPI version which may not match the JAX container's CUDA 13.0. This was present in the base commit for this exact job.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
.github/workflows/build.yml
Outdated
| run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v | ||
| env: | ||
| NVTE_FRAMEWORK: none | ||
| MAX_JOBS: 1 | ||
| SCCACHE_GHA_ENABLED: "true" | ||
| SCCACHE_CACHE_SIZE: "5G" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_JOBS=1 removed from Core job. Check that the build completes successfully without this limit to prevent OOM issues.
Signed-off-by: Pawel Gadzinski <[email protected]>
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 2 comments
.github/workflows/build.yml
Outdated
| image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04 | ||
| options: --user root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Check that CUDA 12.1.0 (downgraded from 12.8.0) doesn't break PyTorch compatibility or cause runtime issues
| container: | ||
| image: ghcr.io/nvidia/jax:jax | ||
| image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04 | ||
| options: --user root |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched from ghcr.io/nvidia/jax:jax to base CUDA container - verify JAX[cuda12] install is compatible with CUDA 12.1 and includes all necessary dependencies
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, 1 comment
.github/workflows/build.yml
Outdated
| apt-get update | ||
| apt-get install -y git python3.9 pip cudnn9-cuda-12 | ||
| pip install cmake==3.21.0 pybind11[global] ninja packaging | ||
| pip install jax |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pip install jax installs CPU-only JAX by default. Need jax[cuda12] to match CUDA 13.0 runtime
| pip install jax | |
| pip install jax[cuda12] |
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Signed-off-by: Pawel Gadzinski <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
2 files reviewed, no comments
Description
This PR fixes following issues:
Fixes # (issue)
Type of change
Checklist: