Welcome to ISQ Skynet an iSquared Solutions project dedicated to training machine learning models using a GitOps approach with GitHub Actions on a self-hosted runner.
In this case the self-hosted runner is a desktop running on Ubuntu 20.04 with a GTX 1070 GPU. The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.
The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.
To configure your own self-hosted runner, configure a machine with a GPU to use the nvidia-docker runtime.
curl -s -L https://nvidia.GitHub.io/nvidia-docker/gpgkey | sudo apt-key add - && \
curl -s -L https://nvidia.GitHub.io/nvidia-docker/ubuntu20.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list && \
sudo apt update && sudo apt install -y ubuntu-drivers-common && \
sudo ubuntu-drivers autoinstall && \
sudo apt install -y nvidia-container-toolkit && \
sudo systemctl restart dockerdocker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi
You should see output similar to the following:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02 Driver Version: 470.199.02 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:02:00.0 Off | N/A |
| 0% 44C P5 8W / 151W | 0MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+The base iterativeai/cml:0-dvc2-base1-gpu image comes with Python 3.8. However, the peft module along with other dependencies in the train.py workflow requires Python 3.9. The following custom Dockerfile adds Python 3.9 to the base image.
# Use iterativeai/cml:0-dvc2-base1-gpu as base image
FROM iterativeai/cml:0-dvc2-base1-gpu
# Install Python 3.9
RUN apt-get update && apt-get install -y \
software-properties-common \
&& add-apt-repository ppa:deadsnakes/ppa \
&& apt-get update \
&& apt-get install -y python3.9 \
&& apt-get install -y python3.9-distutils \
&& apt-get install -y python3.9-venv \
&& rm -rf /var/lib/apt/lists/*
# Make Python 3.9 the default version
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1
# Verify that Python 3.9 is installed
RUN python3 --versiondocker build -t skynet-llama-qlora .docker run -d \
--name skynet_llama_qlora \
--gpus all \
-e RUNNER_IDLE_TIMEOUT=259200 \
-e RUNNER_LABELS=cml,gpu,llama,qlora \
-e RUNNER_REPO="https://github.com/FlipsideCrypto/skynet.git" \
-e repo_token=<REDACTED> \
skynet-llama-qlora runner --log debug --driver githubRUNNER_IDLE_TIMEOUT is the time in seconds that the runner is going to be idle at most waiting for jobs to arrive, if none arrives the runner shuts down and unregisters from your repo. We set it to 3 days (259200 seconds).
RUNNER_LABELS a comma delimited list of labels that we are setting in our workflow that the jobs will wait for.
RUNNER_REPO is the url of your GitLab or GitHub repo. repo_token is the personal token generated for your GitHub or GitLab repo. Note that for GitHub you must check workflow along with repo.
Once you have the docker contianer up and running you can check the status of the runner in your repository under Settings -> Actions -> Runners.
- GitHub Actions can’t run a workflow longer than 72 hours.
- The GTX 1070 only has
8GBof VRAM. This limits the size of the models that can be trained on the GPU, using quantization techniques can help reduce the memory footprint of the model. We use bitsnbytes along with the qLora method with the peft module to quantizeLlama v2. - In addition to
quantizingllama v2we also need to ensure maximum utilization of the GPU.
Xorg and GNOME Shell: These are related to the graphical user interface. If you're comfortable working without a GUI, you can stop the display manager to free up some GPU memory. However, this will terminate your graphical session.
sudo systemctl stop display-manager├── Dockerfile
├── README.md
├── requirements.txt
├── trainer.py
└── train.py
Dockerfile: Custom Dockerfile to addPython 3.9to theiterativeai/cml:0-dvc2-base1-gpuimage.requirements.txt: List of Python dependencies required for the training workflow.trainer.py: Script to train the model and collect GPU metrics in parallel.train.py: Main training script that initializes and trains the model.
trainer.py manages the training process and collects GPU metrics in parallel invokes train.py that sets up the training environment . The integration with GitHub Actions ensures that the training process is automated, and results are reported back to the repository.
The llama_qlora_cml.yaml github actions workflow is triggered on every push. It runs on a self-hosted GitHub Actions runner with the labels cml, gpu, llama, and qlora.

