Skip to content

Latest commit

 

History

History
344 lines (324 loc) · 9.16 KB

File metadata and controls

344 lines (324 loc) · 9.16 KB

Install CUDA Toolkit 12.8

Installation link: https://developer.nvidia.com/cuda-12-8-0-download-archive

Base Installer

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2404/x86_64/cuda-ubuntu2404.pin
sudo mv cuda-ubuntu2404.pin /etc/apt/preferences.d/cuda-repository-pin-600
wget https://developer.download.nvidia.com/compute/cuda/12.8.0/local_installers/cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo dpkg -i cuda-repo-ubuntu2404-12-8-local_12.8.0-570.86.10-1_amd64.deb
sudo cp /var/cuda-repo-ubuntu2404-12-8-local/cuda-*-keyring.gpg /usr/share/keyrings/
sudo apt-get update
sudo apt-get -y install cuda-toolkit-12-8

Driver Installer

sudo apt-get install -y cuda-drivers

Add to Path

CUDA_PATH=/usr/local/cuda
CUDA_PATH_LINE="export PATH=\$CUDA_PATH/bin:\$PATH"
CUDA_LD_LIBRARY_PATH_LINE="export LD_LIBRARY_PATH=\$CUDA_PATH/lib64:\$LD_LIBRARY_PATH"
echo "CUDA_PATH=${CUDA_PATH}" >> ~/.bashrc
echo "$CUDA_PATH_LINE" >> ~/.bashrc
echo "$CUDA_LD_LIBRARY_PATH_LINE" >> ~/.bashrc
source ~/.bashrc

Create a Kubernetes Cluster using Containerd Runtime

Source: Step-by-Step Guide to Creating a Kubernetes Cluster on Ubuntu 22.04 Using Containerd Runtime

CP apply on control-plane node only
W apply on worker node only
CP-W apply on both control-plane and worker node

1. Disable Ubuntu Firewall/ CP-W

sudo ufw disable

Verify that status with:

sudo ufw status

2. Update and ugrade the system/ CP-W

sudo apt update
sudo apt -y full-upgrade

3. Enable time-sync with an NTP server/ CP-W

sudo apt install systemd-timesyncd
sudo timedatectl set-ntp true

Check status with

sudo timedatectl status

and verify that NTP services is marked as active: "NTP service: active"

4. Turn off the swap/ CP-W

sudo swapoff -a
sudo sed -i.bak -r 's/(.+ swap .+)/#\1/' /etc/fstab

Check the status with the free -m command.

free -m

Swap values should have been set to 0. Check the fstab file as well. Otherwise swap will be turned on automatically on reboot.

cat /etc/fstab | grep swap

The 'swapfile' entry should be commented out.

5. Configure required kernel modules/ CP-W

sudo nano /etc/modules-load.d/k8s.conf

Add the following content to the file, save and close it

overlay
br_netfilter

Load the modules above into the current session

sudo modprobe overlay
sudo modprobe br_netfilter

Check the status

lsmod | grep "overlay\|br_netfilter"

6. Configure network parameters/ CP-W

sudo nano /etc/sysctl.d/k8s.conf

Add the following content to the file, save and close it

net.bridge.bridge-nf-call-ip6tables = 1
net.bridge.bridge-nf-call-iptables = 1
net.ipv4.ip_forward = 1

Apply the newly added network params

sudo sysctl --system

7. Install necessary software tools to continue/ CP-W

sudo apt-get install -y apt-transport-https ca-certificates curl gpg gnupg2 software-properties-common

Now both nodes are ready to install the Kubernetes tools and runtime.

8. Install Kubernetes Tools/ CP-W

8.1 Add Kubernetes repository and keys

First, check whether the /etc/apt/keyrings directory is present on your nodes. If not, create the directory using the command below

sudo mkdir -m 755 /etc/apt/keyrings

8.2 Download and add the k8s repository key

curl -fsSL https://pkgs.k8s.io/core:/stable:/v1.32/deb/Release.key | sudo gpg --dearmor -o /etc/apt/keyrings/kubernetes-apt-keyring.gpg

8.3 Add the Kubernetes repository in the source list

echo 'deb [signed-by=/etc/apt/keyrings/kubernetes-apt-keyring.gpg] https://pkgs.k8s.io/core:/stable:/v1.32/deb/ /' | sudo tee /etc/apt/sources.list.d/kubernetes.list

8.4 Update the package manager and install Kubernetes tools

sudo apt update
sudo apt-get install -y kubelet kubeadm kubectl
sudo apt-mark hold kubelet kubeadm kubectl

9. Install containerd runtime/ CP-W

9.1 Add the containerd repository key and add the repository to the source list

curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo gpg --dearmor -o /etc/apt/keyrings/docker.gpg
sudo chmod a+r /etc/apt/keyrings/docker.gpg
echo "deb [arch="$(dpkg --print-architecture)" signed-by=/etc/apt/keyrings/docker.gpg] https://download.docker.com/linux/ubuntu "$(. /etc/os-release && echo "$VERSION_CODENAME")" stable" | sudo tee /etc/apt/sources.list.d/docker.list > /dev/null

9.2 Install containerd

sudo apt update
sudo apt install -y containerd.io

9.3 Configure containerd

sudo mkdir -p /etc/containerd

Generate the default config toml file

sudo containerd config default|sudo tee /etc/containerd/config.toml

Open the generated file in any text editor and verify the following settings:

[plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc]
  runtime_type = "io.containerd.runc.v2"  # <- note that this line might have been missed
  [plugins."io.containerd.grpc.v1.cri".containerd.runtimes.runc.options]
    SystemdCgroup = true # <- note that this could be set as false in the default configuration, please set it to true

Additionally, at the beginning of this config file, you should see the "disabled_plugins" set as an empty list ([]). Please verify that it is indeed empty. Now save the config file, close it and restart the containerd service

sudo systemctl restart containerd
sudo systemctl enable containerd
systemctl status containerd

9.4 Setup crictl for inspecting containers

sudo apt install cri-tools
sudo nano /etc/crictl.yaml

Add the follwing content to the file, save and exit

runtime-endpoint: unix:///run/containerd/containerd.sock
image-endpoint: unix:///run/containerd/containerd.sock
timeout: 2
debug: true
pull-image-on-create: false

9.5 Enable kubelet service

sudo systemctl enable kubelet

10. Initialising the Control-Plane Node/ CP

10.1 Pull necessary Kubernets images

sudo kubeadm config images pull --cri-socket unix:///var/run/containerd/containerd.sock

10.2 Initialise the Control-Plane

sudo kubeadm init \
  --pod-network-cidr=10.244.0.0/16 \
  --cri-socket unix:///var/run/containerd/containerd.sock \
  --v=5

Do not skip this step!

 mkdir -p $HOME/.kube
 sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
 sudo chown $(id -u):$(id -g) $HOME/.kube/config

Optional: Remove taints from the node

kubectl taint nodes --all node-role.kubernetes.io/control-plane-

This enables us to schedule workloads on the node, even though it's part of the control-plane.

10.3 Add network add-on

kubectl apply -f infrastructure/install/kube-flannel.yaml

11. Join the Worker Node to Kubernets Control-Plane

11.1 Generate the join command/ CP

kubeadm token create --print-join-command

11.2 Run the generated command on the Worker Node/ W

11.3 Verify using the following command/ CP

kubectl get nodes -o wide

Install the NVIDIA Container Toolkit

curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get update
sudo apt-get install -y nvidia-container-toolkit

Install K8s Device Plugin

1. Preparing GPU Nodes

This must be done on all GPU nodes in the cluster!

sudo nvidia-ctk runtime configure --runtime=containerd --set-as-default
sudo systemctl restart containerd

2. Enabling GPU Support in Kubernetes

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.17.0/deployments/static/nvidia-device-plugin.yml

3. Verify by running a GPU job

kubectl apply -f infrastructure/install/gpu-pod.yaml
kubectl logs gpu-pod

Expected result:

[Vector addition of 50000 elements]
Copy input data from the host memory to the CUDA device
CUDA kernel launch with 196 blocks of 256 threads
Copy output data from the CUDA device to the host memory
Test PASSED
Done

Install Helm

curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh

Install KEDA

1. Add Helm repo

helm repo add kedacore https://kedacore.github.io/charts

2. Update Helm repo

helm repo update

3. Install keda Helm chart

helm install keda kedacore/keda --namespace keda --create-namespace