Skip to content

Commit 8426b66

Browse files
authored
ci: Migrate to using Nvidia Github Runners (#694)
* Test nv runner Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix testing Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add Azure login Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add Azure CLI Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Testing Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure Azure CLI exists Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update id-token permissions Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Test login Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use environment for Azure login Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Login to Azure nemoci Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Debug runner docker Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use nv-gh-runner for building Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix the build container step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix passing secrets to build step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix build Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Pass has-azure-credentials to build step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Pass in environment to build container step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix build environment Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Do not use inline cache Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Ensure we use PR number for build cache Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Test GPU runner Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix GPU test Signed-off-by: Charlie Truong <chtruong@nvidia.com> * debug test failures Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Debug test Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Test build cache Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix build template Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use hash of Dockerfile and pyproject.toml for tag Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Echo image tag hash Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Mount repo code to the test container Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix the checkout step when running tests Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Truncate to 12 characters for the image tag hash Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert "debug test failures" This reverts commit ea68000. Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update build-container template ref Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Force build for cache Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Skip build if possible Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update comments in gpuci.yml Signed-off-by: Charlie Truong <chtruong@nvidia.com> * debug test Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Use run_id as image tag Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Remove generate-image-tag as needed step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * debug test_classifier Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Revert test change Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Set auto_sync_ready to true Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update build container template ref Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Remove unused skip-build-if-exists Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update to only run gpuci if certain files are changed Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Fix changed files step Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update build-container template ref Signed-off-by: Charlie Truong <chtruong@nvidia.com> * debug Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update changed files ref Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Update build contianer ref Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Increase gpu timeout to 40m Signed-off-by: Charlie Truong <chtruong@nvidia.com> * Add tests directory to track changed files for running gpu tests Signed-off-by: Charlie Truong <chtruong@nvidia.com> --------- Signed-off-by: Charlie Truong <chtruong@nvidia.com>
1 parent 20fd237 commit 8426b66

File tree

3 files changed

+98
-67
lines changed

3 files changed

+98
-67
lines changed

.github/copy-pr-bot.yaml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -26,4 +26,4 @@ additional_vetters:
2626
- VibhuJawa
2727
- arhamm1
2828
auto_sync_draft: false
29-
auto_sync_ready: false
29+
auto_sync_ready: true

.github/workflows/gpuci.yml

Lines changed: 90 additions & 56 deletions
Original file line numberDiff line numberDiff line change
@@ -4,41 +4,62 @@ on:
44
push:
55
branches:
66
- main
7-
pull_request:
8-
branches:
9-
# We can run gpuCI on any PR targeting these branches
10-
- "main"
11-
- "[rv][0-9].[0-9].[0-9]"
12-
- "[rv][0-9].[0-9].[0-9]rc[0-9]"
13-
# PR has to be labeled with "gpuCI" label
14-
types: [labeled, synchronize]
7+
- "pull-request/[0-9]+"
158

169
concurrency:
1710
group: ${{ github.workflow }}-${{ github.event.pull_request.number || github.ref }}
1811
cancel-in-progress: true
1912

13+
permissions:
14+
id-token: write
15+
contents: read
16+
2017
jobs:
18+
changed-files:
19+
runs-on: ubuntu-latest
20+
outputs:
21+
any_changed: ${{ steps.changed-files.outputs.any_changed }}
22+
changed_files: ${{ steps.changed-files.outputs.all_changed_files }}
23+
steps:
24+
- name: Checkout repository
25+
uses: actions/checkout@v4
26+
with:
27+
fetch-depth: 0
28+
29+
- name: Get PR info
30+
id: get-pr-info
31+
if: startsWith(github.ref, 'refs/heads/pull-request/')
32+
uses: nv-gha-runners/get-pr-info@main
33+
34+
- name: Determine base reference
35+
id: base-ref
36+
run: |
37+
if [[ "${{ github.ref }}" == refs/heads/pull-request/* ]]; then
38+
# For PR branches, use the base branch from PR info
39+
echo "base=${{ fromJSON(steps.get-pr-info.outputs.pr-info).base.ref }}" >> $GITHUB_OUTPUT
40+
else
41+
# For other branches, use the last commit
42+
echo "base=HEAD~1" >> $GITHUB_OUTPUT
43+
fi
44+
45+
- name: Get changed files
46+
id: changed-files
47+
uses: step-security/changed-files@v45.0.1
48+
with:
49+
files: |
50+
nemo_curator/**
51+
config/**
52+
.github/**
53+
pyproject.toml
54+
Dockerfile
55+
tests/**
56+
base_sha: ${{ steps.base-ref.outputs.base }}
57+
2158
# First, we build and push a NeMo Curator container
2259
build-container:
23-
# This block covers 3 cases when gpuCI should be triggered:
24-
# 1. The PR has the "gpuCI" label and is opened by a maintainer.
25-
# In this case, gpuCI will autorun on any subsequent pushes to the PR,
26-
# as long as the "gpuCI" label is not removed.
27-
# 2. The "gpuCI" label is added to the PR. If a non-maintainer opened the PR,
28-
# then subsequent pushes to the PR will not autorun gpuCI
29-
# unless the "gpuCI" label is removed and re-added again.
30-
# 3. PR is merged to main.
31-
if: >-
32-
(
33-
contains(github.event.pull_request.labels.*.name, 'gpuci') &&
34-
contains(
35-
'["ayushdg", "ko3n1g", "praateekmahajan", "ryantwolf", "sarahyurick", "VibhuJawa"]',
36-
github.event.pull_request.user.login
37-
)
38-
) ||
39-
(github.event.label.name == 'gpuci') ||
40-
(github.ref == 'refs/heads/main')
41-
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_container.yml@v0.18.0
60+
uses: NVIDIA/NeMo-FW-CI-templates/.github/workflows/_build_container.yml@v0.29.0
61+
needs: [changed-files]
62+
if: ${{ needs.changed-files.outputs.any_changed == 'true' }}
4263
with:
4364
image-name: nemo_curator_container
4465
dockerfile: Dockerfile
@@ -48,36 +69,44 @@ jobs:
4869
REPO_URL=https://github.com/${{ github.repository }}.git
4970
CURATOR_COMMIT=${{ github.sha }}
5071
prune-filter-timerange: 24h
72+
runner: linux-amd64-cpu8
73+
has-azure-credentials: true
74+
use-inline-cache: false
75+
secrets:
76+
AZURE_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }}
77+
AZURE_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }}
78+
AZURE_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
5179

5280
# Then, we run our PyTests in the container we just built
5381
run-gpu-tests:
54-
needs: build-container
55-
# This is the tag on our Azure runner found in Actions -> Runners -> Self-hosted runners
56-
# It has 2 A100 GPUs
57-
runs-on: self-hosted-azure
58-
# Unit tests should not take longer than 30 minutes
59-
timeout-minutes: 30
60-
# This block covers 3 cases when gpuCI should be triggered:
61-
# 1. The PR has the "gpuCI" label and is opened by a maintainer.
62-
# In this case, gpuCI will autorun on any subsequent pushes to the PR,
63-
# as long as the "gpuCI" label is not removed.
64-
# 2. The "gpuCI" label is added to the PR. If a non-maintainer opened the PR,
65-
# then subsequent pushes to the PR will not autorun gpuCI
66-
# unless the "gpuCI" label is removed and re-added again.
67-
# 3. PR is merged to main.
68-
if: >-
69-
(
70-
contains(github.event.pull_request.labels.*.name, 'gpuci') &&
71-
contains(
72-
'["ayushdg", "ko3n1g", "praateekmahajan", "ryantwolf", "sarahyurick", "VibhuJawa"]',
73-
github.event.pull_request.user.login
74-
)
75-
) ||
76-
(github.event.label.name == 'gpuci') ||
77-
(github.ref == 'refs/heads/main')
82+
needs: [build-container]
83+
if: ${{ needs.changed-files.outputs.any_changed == 'true' }}
84+
runs-on: linux-amd64-gpu-rtxa6000-latest-1
85+
environment: nemo-ci
86+
# Unit tests should not take longer than 40 minutes including docker pull and startup time
87+
timeout-minutes: 40
7888
env:
7989
DIR: ${{ github.run_id }}
8090
steps:
91+
- name: Install Azure CLI
92+
run: |
93+
curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
94+
95+
- name: Azure Login
96+
uses: azure/login@v2
97+
with:
98+
client-id: ${{ secrets.AZURE_CLIENT_ID }}
99+
tenant-id: ${{ secrets.AZURE_TENANT_ID }}
100+
subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}
101+
102+
- name: Azure ACR Login
103+
run: |
104+
az acr login --name nemoci
105+
which docker
106+
107+
- name: Checkout NeMo-Curator
108+
uses: actions/checkout@v4
109+
81110
# If something went wrong during the last cleanup, this step ensures any existing container is removed
82111
- name: Remove existing container if it exists
83112
run: |
@@ -86,15 +115,19 @@ jobs:
86115
fi
87116
88117
# This runs the container which was pushed by build-container, which we call "nemo-curator-container"
89-
# `--gpus all` ensures that all of the GPUs from our self-hosted-azure runner are available in the container
90-
# We use "github.run_id" to identify the PR with the commits we want to run the PyTests with
118+
# `--gpus all` ensures that all of the GPUs from our runner are available in the container
91119
# `bash -c "sleep infinity"` keeps the container running indefinitely without exiting
92120
- name: Run Docker container
93121
run: |
94-
docker run --gpus all --name nemo-curator-container -d nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} bash -c "sleep infinity"
122+
docker run \
123+
--gpus all \
124+
--name nemo-curator-container \
125+
-d \
126+
--volume ${{ github.workspace }}:/opt/NeMo-Curator \
127+
nemoci.azurecr.io/nemo_curator_container:${{ github.run_id }} \
128+
bash -c "sleep infinity"
95129
96-
# Expect `whoami` to be "azureuser"
97-
# Expect `nvidia-smi` to show our 2 A100 GPUs
130+
# Expect `nvidia-smi` to show available GPUs
98131
- name: Check GPUs
99132
run: |
100133
whoami
@@ -127,6 +160,7 @@ jobs:
127160
128161
docker exec nemo-curator-container coverage xml
129162
163+
mkdir -p $DIR
130164
docker cp nemo-curator-container:/opt/.coverage $DIR/.coverage
131165
docker cp nemo-curator-container:/opt/coverage.xml $DIR/coverage.xml
132166
coverage_report="codecov"

Dockerfile

Lines changed: 7 additions & 10 deletions
Original file line numberDiff line numberDiff line change
@@ -7,7 +7,7 @@ ARG IMAGE_LABEL
77
ARG REPO_URL
88
ARG CURATOR_COMMIT
99

10-
FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER} as curator-update
10+
FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER} AS curator-update
1111
# Needed to navigate to and pull the forked repository's changes
1212
ARG REPO_URL
1313
ARG CURATOR_COMMIT
@@ -24,7 +24,7 @@ RUN bash -exu <<EOF
2424
EOF
2525

2626

27-
FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER}
27+
FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER} AS deps
2828
LABEL "nemo.library"=${IMAGE_LABEL}
2929
WORKDIR /opt
3030

@@ -50,15 +50,12 @@ RUN \
5050
--mount=type=bind,source=/opt/NeMo-Curator/pyproject.toml,target=/opt/NeMo-Curator/pyproject.toml,from=curator-update \
5151
cd /opt/NeMo-Curator && \
5252
source activate curator && \
53-
pip install ".[all]"
53+
pip install --extra-index-url https://pypi.nvidia.com -e ".[all]"
5454

55-
COPY --from=curator-update /opt/NeMo-Curator/ /opt/NeMo-Curator/
5655

57-
# Clone the user's repository, find the relevant commit, and install everything we need
58-
RUN bash -exu <<EOF
59-
source activate curator
60-
cd /opt/NeMo-Curator/
61-
pip install --extra-index-url https://pypi.nvidia.com ".[all]"
62-
EOF
56+
FROM rapidsai/ci-conda:cuda${CUDA_VER}-${LINUX_VER}-py${PYTHON_VER} AS final
6357

6458
ENV PATH /opt/conda/envs/curator/bin:$PATH
59+
LABEL "nemo.library"=${IMAGE_LABEL}
60+
WORKDIR /opt
61+
COPY --from=deps /opt/conda/envs/curator /opt/conda/envs/curator

0 commit comments

Comments
 (0)