-
Notifications
You must be signed in to change notification settings - Fork 626
Fix Github workflows issues #2636
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 9 commits
ee0cca1
df6a81b
daa3bf3
d47399a
d2091f2
4dc9323
4171efe
b44ec74
23ea443
a0a528f
95c333f
cb3fa26
cc3c5b1
52f6cb2
4ebbe22
89d2985
b5f554e
30cd354
ba96b42
7ad602f
d3bfeeb
03488d9
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | ||||||||||||||||
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
|
@@ -10,7 +10,7 @@ on: | |||||||||||||||||
| jobs: | ||||||||||||||||||
| core: | ||||||||||||||||||
| name: 'Core' | ||||||||||||||||||
| runs-on: ubuntu-latest | ||||||||||||||||||
| runs-on: ubuntu-latest-8-cores | ||||||||||||||||||
| container: | ||||||||||||||||||
| image: nvcr.io/nvidia/cuda:12.1.0-devel-ubuntu22.04 | ||||||||||||||||||
| options: --user root | ||||||||||||||||||
|
|
@@ -30,63 +30,39 @@ jobs: | |||||||||||||||||
| run: NVTE_USE_CCACHE=1 NVTE_CCACHE_BIN=sccache pip install --no-build-isolation . -v | ||||||||||||||||||
| env: | ||||||||||||||||||
| NVTE_FRAMEWORK: none | ||||||||||||||||||
| MAX_JOBS: 1 | ||||||||||||||||||
| SCCACHE_GHA_ENABLED: "true" | ||||||||||||||||||
| SCCACHE_CACHE_SIZE: "5G" | ||||||||||||||||||
| - name: 'Sanity check' | ||||||||||||||||||
| run: python3 -c "import transformer_engine" | ||||||||||||||||||
| working-directory: / | ||||||||||||||||||
| pytorch: | ||||||||||||||||||
| name: 'PyTorch' | ||||||||||||||||||
| runs-on: ubuntu-latest | ||||||||||||||||||
| runs-on: ubuntu-latest-8-cores | ||||||||||||||||||
| container: | ||||||||||||||||||
| image: ghcr.io/nvidia/jax:jax | ||||||||||||||||||
| options: --user root | ||||||||||||||||||
| steps: | ||||||||||||||||||
| - name: Move /var/lib/docker/ | ||||||||||||||||||
| shell: bash -euxo pipefail {0} | ||||||||||||||||||
| run: sudo mv /var/lib/docker/ "${GITHUB_WORKSPACE}/docker" | ||||||||||||||||||
|
|
||||||||||||||||||
| - name: Maximize build space | ||||||||||||||||||
| uses: easimon/maximize-build-space@c28619d8999a147d5e09c1199f84ff6af6ad5794 | ||||||||||||||||||
| with: | ||||||||||||||||||
| root-reserve-mb: 5120 | ||||||||||||||||||
| temp-reserve-mb: 32 | ||||||||||||||||||
| swap-size-mb: 10240 | ||||||||||||||||||
| remove-dotnet: 'true' | ||||||||||||||||||
| remove-android: 'true' | ||||||||||||||||||
| remove-haskell: 'true' | ||||||||||||||||||
| remove-codeql: 'true' | ||||||||||||||||||
| build-mount-path: '/var/lib/docker/' | ||||||||||||||||||
|
|
||||||||||||||||||
| - name: Restore /var/lib/docker/ | ||||||||||||||||||
| shell: bash -euxo pipefail {0} | ||||||||||||||||||
| run: sudo sh -c "mv ${GITHUB_WORKSPACE}/docker/* /var/lib/docker" | ||||||||||||||||||
|
|
||||||||||||||||||
| - name: 'Dependencies' | ||||||||||||||||||
| run: | | ||||||||||||||||||
| pip install --no-cache-dir cmake==3.21.0 pybind11[global] ninja pydantic importlib-metadata>=1.0 packaging numpy einops onnxscript | ||||||||||||||||||
| pip install --no-cache-dir torch | ||||||||||||||||||
|
||||||||||||||||||
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. The JAX container has CUDA 13.0, and commit 4cf2f12 explicitly added this index URL to ensure compatibility. Without it, pip may install a CPU-only or incompatible CUDA version from PyPI.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match CUDA 13.0 in the JAX container. Without it, pip installs the default PyPI version (likely CPU-only or wrong CUDA version).
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. The JAX container has CUDA 13.0, but without this flag pip installs the default PyPI version (likely CPU-only or wrong CUDA version). This was explicitly added in commit 4cf2f12 (#2308) for this exact reason.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Switched from ghcr.io/nvidia/jax:jax to base CUDA container - verify JAX[cuda12] install is compatible with CUDA 12.1 and includes all necessary dependencies
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 when installing PyTorch. The base commit explicitly used this index URL to ensure CUDA 13.0 support matching the JAX container (see commit 4cf2f12). Without it, the default PyPI version will be installed, which may be CPU-only.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was explicitly added in commit 4cf2f12 to match the CUDA 13.0 version in the JAX container. Without it, the default PyPI version will be installed, which may be CPU-only or have incompatible CUDA version.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch installation. This was present in the base commit and is required to match CUDA 13.0 in the JAX container (see commit 4cf2f12).
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Missing --index-url https://download.pytorch.org/whl/cu130 for PyTorch. Without this, pip installs the default PyPI version which may not match the JAX container's CUDA 13.0. This was present in the base commit for this exact job.
| pip install --no-cache-dir torch | |
| pip install --no-cache-dir torch --index-url https://download.pytorch.org/whl/cu130 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAX_JOBS=1removed from Core job. Check that the build completes successfully without this limit to prevent OOM issues.