Skip to content

tzhukov/k8s_gpu_nvidia_debian

Repository files navigation

Kubernetes GPU Cluster Setup for Debian 12

This repository contains automated setup scripts for configuring Debian 12 servers with NVIDIA GPU support for Kubernetes clusters.

Prerequisites

  • Debian 12 (Bookworm) fresh installation
  • NVIDIA GPU hardware
  • Root or sudo access
  • Internet connectivity

What Gets Installed

  1. NVIDIA Drivers - Latest proprietary NVIDIA drivers from Debian repositories
  2. containerd - Container runtime configured for Kubernetes
  3. NVIDIA Container Toolkit - GPU support for containers
  4. Kubernetes Tools - kubectl, kubeadm, and kubelet (v1.31)

Quick Start

Option 1: Run All Setup Scripts at Once

# Clone the repository
git clone https://github.com/tzhukov/k8s_gpu_nvidia_debian.git
cd k8s_gpu_nvidia_debian

# Make scripts executable
chmod +x *.sh

# Run complete setup (requires root)
sudo ./install-all.sh

# Reboot the system
sudo reboot

Option 2: Run Scripts Individually

If you prefer to run scripts step by step or only need specific components:

# 1. Install NVIDIA Drivers
sudo ./setup-nvidia-drivers.sh

# 2. Setup containerd
sudo ./setup-containerd.sh

# 3. Install NVIDIA Container Toolkit
sudo ./setup-nvidia-container-toolkit.sh

# 4. Install Kubernetes packages
sudo ./setup-kubernetes.sh

# Reboot after all installations
sudo reboot

Post-Installation Verification

After rebooting, verify your installation:

Check NVIDIA Drivers

nvidia-smi

Check containerd

sudo systemctl status containerd

Check Kubernetes Tools

kubectl version --client
kubeadm version
kubelet --version

Test GPU in Container

sudo ctr image pull docker.io/nvidia/cuda:12.3.0-base-ubuntu22.04
sudo ctr run --rm --gpus 0 docker.io/nvidia/cuda:12.3.0-base-ubuntu22.04 test nvidia-smi

Initializing Kubernetes Cluster

Master Node Setup

# Initialize the cluster (adjust pod network CIDR as needed)
sudo kubeadm init --pod-network-cidr=10.244.0.0/16

# Configure kubectl for your user
mkdir -p $HOME/.kube
sudo cp -i /etc/kubernetes/admin.conf $HOME/.kube/config
sudo chown $(id -u):$(id -g) $HOME/.kube/config

# Install a CNI plugin (example: Calico)
kubectl apply -f https://docs.projectcalico.org/manifests/calico.yaml

Worker Node Setup

On worker nodes, use the join command from the master node output:

sudo kubeadm join <master-ip>:6443 --token <token> --discovery-token-ca-cert-hash sha256:<hash>

Enabling GPU Support in Kubernetes

Install the NVIDIA GPU device plugin:

kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.5/nvidia-device-plugin.yml

Verify GPU nodes:

kubectl get nodes "-o=custom-columns=NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu"

Testing GPU Workload

Create a test pod that uses GPU:

apiVersion: v1
kind: Pod
metadata:
  name: gpu-test
spec:
  containers:
  - name: cuda-container
    image: nvidia/cuda:12.3.0-base-ubuntu22.04
    command: ["nvidia-smi"]
    resources:
      limits:
        nvidia.com/gpu: 1
  restartPolicy: Never

Apply and check:

kubectl apply -f gpu-test.yaml
kubectl logs gpu-test

Troubleshooting

NVIDIA Driver Issues

# Check if driver is loaded
lsmod | grep nvidia

# Reinstall if needed
sudo apt-get install --reinstall nvidia-driver

containerd Issues

# Check containerd logs
sudo journalctl -u containerd -f

# Restart containerd
sudo systemctl restart containerd

Kubernetes Issues

# Check kubelet status
sudo systemctl status kubelet

# View kubelet logs
sudo journalctl -u kubelet -f

# Reset cluster (WARNING: destroys cluster)
sudo kubeadm reset

Script Details

install-all.sh

Master script that runs all setup scripts in the correct order.

setup-nvidia-drivers.sh

  • Enables non-free repositories
  • Installs kernel headers and DKMS
  • Installs NVIDIA proprietary drivers

setup-containerd.sh

  • Loads required kernel modules
  • Configures sysctl parameters
  • Installs and configures containerd with systemd cgroup driver

setup-nvidia-container-toolkit.sh

  • Adds NVIDIA Container Toolkit repository
  • Installs nvidia-container-toolkit
  • Configures containerd for NVIDIA runtime

setup-kubernetes.sh

  • Disables swap
  • Adds Kubernetes repository (v1.31)
  • Installs kubectl, kubeadm, and kubelet
  • Holds packages to prevent automatic updates

System Requirements

  • OS: Debian 12 (Bookworm)
  • GPU: NVIDIA GPU with driver support
  • Memory: Minimum 2GB RAM (4GB+ recommended)
  • CPU: 2+ cores recommended
  • Disk: 20GB+ available space

Security Considerations

  • These scripts disable swap and modify kernel parameters
  • NVIDIA proprietary drivers are installed
  • Kubernetes packages are held at specific versions
  • Always review scripts before running with root privileges

Contributing

Contributions are welcome! Please feel free to submit issues or pull requests.

License

MIT License - See LICENSE file for details

References

About

Just some setup scripts

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages