Skip to content

Train machine learning models using a `GitOps` approach with GitHub Actions on a self-hosted runner.

Notifications You must be signed in to change notification settings

ShahNewazKhan/skynet

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ISQ SKYNET

Welcome to ISQ Skynet an iSquared Solutions project dedicated to training machine learning models using a GitOps approach with GitHub Actions on a self-hosted runner.

fsc skynet gitops overview

In this case the self-hosted runner is a desktop running on Ubuntu 20.04 with a GTX 1070 GPU. The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.

The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.

Configure your own self hosted runner

To configure your own self-hosted runner, configure a machine with a GPU to use the nvidia-docker runtime.

1) Install nvidia drivers and nvidia-docker in your machine (Ubuntu 20.04 in this example)

curl -s -L https://nvidia.GitHub.io/nvidia-docker/gpgkey | sudo apt-key add - && \
curl -s -L https://nvidia.GitHub.io/nvidia-docker/ubuntu20.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list && \
sudo apt update && sudo apt install -y ubuntu-drivers-common  && \
sudo ubuntu-drivers autoinstall  && \
sudo apt install -y nvidia-container-toolkit && \
sudo systemctl restart docker

2) Test that your gpus are up and running with the following command:

docker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi

You should see output similar to the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   44C    P5     8W / 151W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3) Start your self-hosted runner

The base iterativeai/cml:0-dvc2-base1-gpu image comes with Python 3.8. However, the peft module along with other dependencies in the train.py workflow requires Python 3.9. The following custom Dockerfile adds Python 3.9 to the base image.

Dockerfile

# Use iterativeai/cml:0-dvc2-base1-gpu as base image
FROM iterativeai/cml:0-dvc2-base1-gpu

# Install Python 3.9
RUN apt-get update && apt-get install -y \
    software-properties-common \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt-get update \
    && apt-get install -y python3.9 \
    && apt-get install -y python3.9-distutils \
    && apt-get install -y python3.9-venv \
    && rm -rf /var/lib/apt/lists/*

# Make Python 3.9 the default version
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1

# Verify that Python 3.9 is installed
RUN python3 --version
Build the custom image
docker build -t skynet-llama-qlora .
Run the self-hosted runner container
docker run -d \
--name skynet_llama_qlora \
--gpus all \
-e RUNNER_IDLE_TIMEOUT=259200 \
-e RUNNER_LABELS=cml,gpu,llama,qlora \
-e RUNNER_REPO="https://github.com/FlipsideCrypto/skynet.git" \
-e repo_token=<REDACTED> \
skynet-llama-qlora runner --log debug --driver github
Docker environment variables

RUNNER_IDLE_TIMEOUT is the time in seconds that the runner is going to be idle at most waiting for jobs to arrive, if none arrives the runner shuts down and unregisters from your repo. We set it to 3 days (259200 seconds).

RUNNER_LABELS a comma delimited list of labels that we are setting in our workflow that the jobs will wait for.

RUNNER_REPO is the url of your GitLab or GitHub repo. repo_token is the personal token generated for your GitHub or GitLab repo. Note that for GitHub you must check workflow along with repo.

Once you have the docker contianer up and running you can check the status of the runner in your repository under Settings -> Actions -> Runners.

actions runner

Caveats for self-hosted runners
  • GitHub Actions can’t run a workflow longer than 72 hours.
  • The GTX 1070 only has 8GB of VRAM. This limits the size of the models that can be trained on the GPU, using quantization techniques can help reduce the memory footprint of the model. We use bitsnbytes along with the qLora method with the peft module to quantize Llama v2.
  • In addition to quantizing llama v2 we also need to ensure maximum utilization of the GPU.
Remove Unnecessary Processes from the GPU:

Xorg and GNOME Shell: These are related to the graphical user interface. If you're comfortable working without a GUI, you can stop the display manager to free up some GPU memory. However, this will terminate your graphical session.

sudo systemctl stop display-manager

Key Files In The Repository

├── Dockerfile
├── README.md
├── requirements.txt
├── trainer.py
└── train.py

Overview

  • Dockerfile: Custom Dockerfile to add Python 3.9 to the iterativeai/cml:0-dvc2-base1-gpu image.
  • requirements.txt: List of Python dependencies required for the training workflow.
  • trainer.py: Script to train the model and collect GPU metrics in parallel.
  • train.py: Main training script that initializes and trains the model.

Training Workflow

trainer.py manages the training process and collects GPU metrics in parallel invokes train.py that sets up the training environment . The integration with GitHub Actions ensures that the training process is automated, and results are reported back to the repository.

GitHub Actions Workflow

The llama_qlora_cml.yaml github actions workflow is triggered on every push. It runs on a self-hosted GitHub Actions runner with the labels cml, gpu, llama, and qlora.

About

Train machine learning models using a `GitOps` approach with GitHub Actions on a self-hosted runner.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors