Hands-on AI/ML infrastructure on a single Debian server with Kubernetes and consumer GPUs. This repository documents everything I built, broke, and learned — from bare-metal GPU setup to training LoRA models and generating music with AI.
A learning lab for AI/ML on consumer hardware. Instead of cloud GPU rentals, everything runs on a single physical server: Kubernetes, multi-GPU management, model training pipelines, RL reasoning, and creative AI workflows. Each project is a self-contained module that documents the full journey — including the failures.
| Component |
Spec |
| CPU |
AMD Ryzen 9 5900X (12-core) |
| RAM |
16 GB DDR4 |
| Swap |
32 GB (btrfs swapfile) |
| Storage |
1.9 TB NVMe |
| GPU 1 |
NVIDIA RTX 5090 — 32 GB VRAM |
| GPU 2 |
NVIDIA RTX 3080 — 10 GB VRAM |
| GPU 3 |
NVIDIA RTX 2070 SUPER — 8 GB VRAM |
| OS |
Debian 13 (trixie) |
┌───────────────────────────────────────────────────────────────┐
│ Debian 13 — Kernel 6.12 — NVIDIA Driver 590.48 │
│ │
│ ┌─────────────────────────────────────────────────────────┐ │
│ │ Kubernetes v1.35.0 (kubeadm, single-node) │ │
│ │ CNI: Cilium 1.18.5 | GPU: NVIDIA Device Plugin │ │
│ │ │ │
│ │ ┌──────────────┐ ┌──────────────┐ ┌─────────────────┐ │ │
│ │ │ Training │ │ Inference │ │ Services │ │ │
│ │ │ │ │ │ │ │ │ │
│ │ │ - SFT/GRPO │ │ - llama-srv │ │ - ChatterBox │ │ │
│ │ │ - LoRA │ │ (Qwen3.5) │ │ - ComfyUI │ │ │
│ │ │ - MTP │ │ - Ollama │ │ - AI-Toolkit │ │ │
│ │ └──────────────┘ └──────────────┘ └─────────────────┘ │ │
│ └─────────────────────────────────────────────────────────┘ │
│ │
│ ┌─────────────────── GPU Pool (50 GB VRAM) ──────────────┐ │
│ │ │ │
│ │ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ │ │
│ │ │ RTX 5090 │ │ RTX 3080 │ │ RTX 2070S │ │ │
│ │ │ 32 GB │ │ 10 GB │ │ 8 GB │ │ │
│ │ └─────────────┘ └─────────────┘ └─────────────┘ │ │
│ │ │ │
│ │ GPUs are dynamically assigned to workloads via │ │
│ │ UUID pinning + NVIDIA Device Plugin. Typical configs: │ │
│ │ │ │
│ │ Training: 5090 (32GB) — SFT, GRPO, LoRA │ │
│ │ Inference: 5090 + 2070S (40GB) — llama-server │ │
│ │ Services: 3080 (10GB) — ChatterBox, Ollama │ │
│ │ Full pool: all 3 GPUs (50GB) — large model serving │ │
│ └─────────────────────────────────────────────────────────┘ │
└───────────────────────────────────────────────────────────────┘
| # |
Project |
Status |
Description |
| 01 |
LoRA Training |
Done |
FLUX.1-dev OOM → SDXL pivot → 10k-step LoRA on custom character |
| 02 |
Dataset Creation |
Done |
ComfyUI pipeline: Qwen Image Edit + Florence2 auto-captioning |
| 03 |
Music Generation |
Done |
ACE-Step 1.5 music generation via ComfyUI |
| 04 |
Multi-Token Prediction |
Done |
Reproduced Meta's MTP paper on single RTX 5090 (1.8x inference speedup) |
| 05 |
GRPO Reasoning |
Done |
Taught Qwen3.5-0.8B to reason like DeepSeek-R1 (+5.9pp zero-shot GSM8K) |
- Orchestration: Kubernetes v1.35.0 (kubeadm)
- CNI: Cilium 1.18.5
- Container Runtime: containerd 1.7.28 with NVIDIA runtime
- GPU Management: NVIDIA Device Plugin 0.17.1
- NVIDIA Driver: 590.48.01 | CUDA 12.8
- Training: AI-Toolkit (ostris)
- Workflows: ComfyUI 1.38.13
| Model |
Description |
Link |
| Llama-3.2-1B-MTP-k8 |
Multi-Token Prediction reproduction (1.8x speedup) |
HuggingFace |
| Qwen3.5-0.8B-GRPO-Math |
GRPO reasoning training (+5.9pp zero-shot GSM8K) |
HuggingFace |
gpu-lab/
├── docs/ # Infrastructure setup guides
├── system/ # OS-level configs (sysctl, containerd, modprobe)
├── kubernetes/ # Helm install docs (Cilium, NVIDIA plugin)
├── workloads/ # Kubernetes manifests (AI-Toolkit, ComfyUI, llama-server)
├── projects/ # Self-contained learning modules
│ ├── 01-lora-training/ # LoRA fine-tuning on SDXL
│ ├── 02-dataset-creation/ # Training dataset pipeline
│ ├── 03-music-generation/ # ACE-Step music generation
│ ├── 04-multi-token-prediction/ # MTP paper reproduction
│ └── 05-grpo-reasoning/ # GRPO reasoning training (DeepSeek-R1 technique)
├── model-cards/ # HuggingFace model card templates
└── assets/ # Screenshots and diagrams