ISQ SKYNET

Welcome to ISQ Skynet an iSquared Solutions project dedicated to training machine learning models using a GitOps approach with GitHub Actions on a self-hosted runner.

In this case the self-hosted runner is a desktop running on Ubuntu 20.04 with a GTX 1070 GPU. The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.

The runner is configured to use the nvidia-docker runtime to enable GPU support for docker containers.

Configure your own self hosted runner

To configure your own self-hosted runner, configure a machine with a GPU to use the nvidia-docker runtime.

1) Install nvidia drivers and nvidia-docker in your machine (Ubuntu 20.04 in this example)

curl -s -L https://nvidia.GitHub.io/nvidia-docker/gpgkey | sudo apt-key add - && \
curl -s -L https://nvidia.GitHub.io/nvidia-docker/ubuntu20.04/nvidia-docker.list | sudo tee /etc/apt/sources.list.d/nvidia-docker.list && \
sudo apt update && sudo apt install -y ubuntu-drivers-common  && \
sudo ubuntu-drivers autoinstall  && \
sudo apt install -y nvidia-container-toolkit && \
sudo systemctl restart docker

2) Test that your gpus are up and running with the following command:

docker run --gpus all iterativeai/cml:0-dvc2-base1-gpu nvidia-smi

You should see output similar to the following:

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.199.02   Driver Version: 470.199.02   CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:02:00.0 Off |                  N/A |
|  0%   44C    P5     8W / 151W |      0MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
                                                                               
+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
|  No running processes found                                                 |
+-----------------------------------------------------------------------------+

3) Start your self-hosted runner

The base iterativeai/cml:0-dvc2-base1-gpu image comes with Python 3.8. However, the peft module along with other dependencies in the train.py workflow requires Python 3.9. The following custom Dockerfile adds Python 3.9 to the base image.

Dockerfile

# Use iterativeai/cml:0-dvc2-base1-gpu as base image
FROM iterativeai/cml:0-dvc2-base1-gpu

# Install Python 3.9
RUN apt-get update && apt-get install -y \
    software-properties-common \
    && add-apt-repository ppa:deadsnakes/ppa \
    && apt-get update \
    && apt-get install -y python3.9 \
    && apt-get install -y python3.9-distutils \
    && apt-get install -y python3.9-venv \
    && rm -rf /var/lib/apt/lists/*

# Make Python 3.9 the default version
RUN update-alternatives --install /usr/bin/python3 python3 /usr/bin/python3.9 1

# Verify that Python 3.9 is installed
RUN python3 --version

Build the custom image

docker build -t skynet-llama-qlora .

Run the self-hosted runner container

docker run -d \
--name skynet_llama_qlora \
--gpus all \
-e RUNNER_IDLE_TIMEOUT=259200 \
-e RUNNER_LABELS=cml,gpu,llama,qlora \
-e RUNNER_REPO="https://github.com/FlipsideCrypto/skynet.git" \
-e repo_token=<REDACTED> \
skynet-llama-qlora runner --log debug --driver github

Docker environment variables

RUNNER_IDLE_TIMEOUT is the time in seconds that the runner is going to be idle at most waiting for jobs to arrive, if none arrives the runner shuts down and unregisters from your repo. We set it to 3 days (259200 seconds).

RUNNER_LABELS a comma delimited list of labels that we are setting in our workflow that the jobs will wait for.

RUNNER_REPO is the url of your GitLab or GitHub repo. repo_token is the personal token generated for your GitHub or GitLab repo. Note that for GitHub you must check workflow along with repo.

Once you have the docker contianer up and running you can check the status of the runner in your repository under Settings -> Actions -> Runners.

Caveats for self-hosted runners

GitHub Actions can’t run a workflow longer than 72 hours.
The GTX 1070 only has 8GB of VRAM. This limits the size of the models that can be trained on the GPU, using quantization techniques can help reduce the memory footprint of the model. We use bitsnbytes along with the qLora method with the peft module to quantize Llama v2.
In addition to quantizing llama v2 we also need to ensure maximum utilization of the GPU.

Remove Unnecessary Processes from the GPU:

Xorg and GNOME Shell: These are related to the graphical user interface. If you're comfortable working without a GUI, you can stop the display manager to free up some GPU memory. However, this will terminate your graphical session.

sudo systemctl stop display-manager

Key Files In The Repository

├── Dockerfile
├── README.md
├── requirements.txt
├── trainer.py
└── train.py

Overview

Dockerfile: Custom Dockerfile to add Python 3.9 to the iterativeai/cml:0-dvc2-base1-gpu image.
requirements.txt: List of Python dependencies required for the training workflow.
trainer.py: Script to train the model and collect GPU metrics in parallel.
train.py: Main training script that initializes and trains the model.

Training Workflow

trainer.py manages the training process and collects GPU metrics in parallel invokes train.py that sets up the training environment . The integration with GitHub Actions ensures that the training process is automated, and results are reported back to the repository.

GitHub Actions Workflow

The llama_qlora_cml.yaml github actions workflow is triggered on every push. It runs on a self-hosted GitHub Actions runner with the labels cml, gpu, llama, and qlora.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ISQ SKYNET

Configure your own self hosted runner

1) Install nvidia drivers and nvidia-docker in your machine (Ubuntu 20.04 in this example)

2) Test that your gpus are up and running with the following command:

3) Start your self-hosted runner

Dockerfile

Build the custom image

Run the self-hosted runner container

Docker environment variables

Caveats for self-hosted runners

Remove Unnecessary Processes from the GPU:

Key Files In The Repository

Overview

Training Workflow

GitHub Actions Workflow

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.github/workflows		.github/workflows
assets		assets
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt
train.py		train.py
trainer.py		trainer.py

ShahNewazKhan/skynet

Folders and files

Latest commit

History

Repository files navigation

ISQ SKYNET

Configure your own self hosted runner

1) Install nvidia drivers and nvidia-docker in your machine (Ubuntu 20.04 in this example)

2) Test that your gpus are up and running with the following command:

3) Start your self-hosted runner

Dockerfile

Build the custom image

Run the self-hosted runner container

Docker environment variables

Caveats for self-hosted runners

Remove Unnecessary Processes from the GPU:

Key Files In The Repository

Overview

Training Workflow

GitHub Actions Workflow

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages