Skip to content

ci: test gpu on self-hosted runners #108

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 4 commits into
base: master
Choose a base branch
from
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
37 changes: 29 additions & 8 deletions .github/workflows/ci.yml
Original file line number Diff line number Diff line change
@@ -1,8 +1,12 @@
name: CI

on: [pull_request, push]
on:
pull_request:
push:
branches:
- master

# Cancel a job if there's a new on on the same branch started.
# Cancel a job if there's a new one on the same branch started.
# Based on https://stackoverflow.com/questions/58895283/stop-already-running-workflow-job-in-github-actions/67223051#67223051
concurrency:
group: ${{ github.ref }}
Expand All @@ -14,8 +18,7 @@ env:
# Faster crates.io index checkout.
CARGO_REGISTRIES_CRATES_IO_PROTOCOL: sparse
RUST_LOG: debug
# Build the kernel only for the single architecture . This should reduce
# the overall compile-time significantly.
# Build the kernel only for the single architecture. This should reduce the overall compile-time significantly.
EC_GPU_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
BELLMAN_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
NEPTUNE_CUDA_NVCC_ARGS: --fatbin --gpu-architecture=sm_75 --generate-code=arch=compute_75,code=sm_75
Expand All @@ -27,7 +30,9 @@ jobs:
steps:
- uses: actions/checkout@v4
- name: Install required packages
run: sudo apt install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
- name: Install cargo clippy
run: rustup component add clippy
- name: Run cargo clippy
Expand All @@ -44,13 +49,29 @@ jobs:
run: cargo fmt --all -- --check

test:
runs-on: ubuntu-24.04
runs-on: ['self-hosted', 'linux', 'x64', '2xlarge+gpu']
name: Test
steps:
- uses: actions/checkout@v4
# TODO: Move the driver installation to the AMI.
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/install-nvidia-driver.html
# https://www.nvidia.com/en-us/drivers/
- name: Install CUDA drivers
run: |
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Downloading executable content over HTTP without integrity verification is a security risk. Consider adding SHA256 checksum verification after the curl command to ensure the downloaded file hasn't been tampered with.

Suggested change
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
curl -L -o nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb https://us.download.nvidia.com/tesla/570.148.08/nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
# Verify SHA256 checksum
echo "b1e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2e2 nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb" > nvidia-driver.sha256
sha256sum -c nvidia-driver.sha256

Copilot uses AI. Check for mistakes.

sudo dpkg -i nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
sudo cp /var/nvidia-driver-local-repo-ubuntu2404-570.148.08/nvidia-driver-local-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get install --no-install-recommends --yes cuda-drivers
rm nvidia-driver-local-repo-ubuntu2404-570.148.08_1.0-1_amd64.deb
- name: Install required packages
run: sudo apt install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# In case no GPUs are available, it's using the CPU fallback.
run: |
sudo apt-get update
sudo apt-get install --no-install-recommends --yes libhwloc-dev nvidia-cuda-toolkit ocl-icd-opencl-dev
# TODO: Remove this and other rust installation directives from jobs running
Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The TODO comment is incomplete and unclear. It should specify what should be removed and why, or what the complete sentence should be.

Suggested change
# TODO: Remove this and other rust installation directives from jobs running
# TODO: Once the AMI includes the Rust toolchain, remove this step and any other Rust installation directives from CI jobs to avoid redundant installations and speed up workflow execution.

Copilot uses AI. Check for mistakes.

Copy link
Preview

Copilot AI Aug 10, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[nitpick] Using a specific commit SHA for the action is good for security, but consider adding a comment explaining why this specific version is pinned, especially since the TODO above mentions removing rust installation directives.

Suggested change
# TODO: Remove this and other rust installation directives from jobs running
# TODO: Remove this and other rust installation directives from jobs running
# Pinned to a specific commit SHA for security and reproducibility.
# This version was chosen to ensure compatibility with the workflow; update only after verifying changes.

Copilot uses AI. Check for mistakes.

- uses: dtolnay/rust-toolchain@21dc36fb71dd22e3317045c0c31a3f4249868b17
with:
toolchain: 1.83
Comment on lines +72 to +74
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kind of sucks that we can't just use the rust-toolchain file for versioning, but note from https://github.com/dtolnay/rust-toolchain?tab=readme-ov-file#inputs about versioning:

Rustup toolchain specifier e.g. stable, nightly, 1.42.0, nightly-2022-01-01. Important: the default is to match the @Rev as described above. When passing an explicit toolchain as an input instead of @Rev, you'll want to use "dtolnay/rust-toolchain@master" as the revision of the action.

i.e. it wants you to use dtolnay/[email protected] instead.

(I also notice other poeple are annoyed by this gap).

- name: Test
run: cargo test --verbose

Expand Down
Loading