|
| 1 | +# AKS GPU Fine-tuning with LoRA |
| 2 | + |
| 3 | +Fine-tune and deploy large language models on Azure Kubernetes Service (AKS) with GPU support using LoRA (Low-Rank Adaptation). |
| 4 | + |
| 5 | +## Use Case |
| 6 | + |
| 7 | +**Problem:** Organizations need AI models that perform internal reasoning in a specific language (e.g., for regulatory audits) while maintaining flexibility in input/output languages for end users. |
| 8 | + |
| 9 | +**Solution:** LoRA fine-tuning to modify the model's reasoning behavior—something not achievable through RAG, prompt engineering, or agentic approaches. |
| 10 | + |
| 11 | +**Example:** A Swiss bank requires all AI reasoning traces to be in German for audit compliance, but wants customers to interact in any language. The fine-tuned model: |
| 12 | +- Receives a question in English, French, or any language |
| 13 | +- Performs all internal chain-of-thought reasoning in German |
| 14 | +- Responds to the user in their original language |
| 15 | + |
| 16 | +This repo demonstrates fine-tuning GPT-OSS 20B to achieve this behavior using LoRA on Azure Kubernetes Service with H100 GPUs. |
| 17 | + |
| 18 | +## Architecture |
| 19 | + |
| 20 | + |
| 21 | + |
| 22 | +## Demo |
| 23 | + |
| 24 | + |
| 25 | + |
| 26 | + |
| 27 | +## Features |
| 28 | + |
| 29 | +- **Automated Azure Infrastructure**: Creates AKS cluster, ACR, storage, and managed identities |
| 30 | +- **GPU-Optimized**: Uses NVIDIA GPU Operator and NC-series VMs |
| 31 | +- **GPU Monitoring**: Azure Managed Prometheus + Grafana with DCGM metrics |
| 32 | +- **LoRA Fine-tuning**: Efficient fine-tuning with parameter-efficient methods |
| 33 | +- **Side-by-Side Inference**: Compare fine-tuned vs baseline models via Web UI |
| 34 | + |
| 35 | +## Prerequisites |
| 36 | + |
| 37 | +**Tools:** |
| 38 | +- **Azure CLI** - [Install](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli) |
| 39 | +- **kubectl** - [Install](https://kubernetes.io/docs/tasks/tools/) or `az aks install-cli` |
| 40 | +- **Helm** - [Install](https://helm.sh/docs/intro/install/) |
| 41 | +- **Bash shell** - WSL/WSL2 (Windows), Terminal (macOS/Linux), or Git Bash |
| 42 | + |
| 43 | +**Azure:** |
| 44 | +- Azure subscription with **Owner** or **Contributor + User Access Administrator** role |
| 45 | +- GPU quota for `Standard_NC80adis_H100_v5` (80 vCPUs or 1 node) in your region |
| 46 | +- Request quota at: [Azure Portal → Quotas](https://portal.azure.com/#view/Microsoft_Azure_Capacity/QuotaMenuBlade/~/myQuotas) |
| 47 | +## Quick Deploy |
| 48 | + |
| 49 | +```bash |
| 50 | +# 1. Configure your environment |
| 51 | +cp config.sh.template config.sh |
| 52 | +# Edit config.sh with your Azure subscription and settings |
| 53 | +# Requires Azure subscription with GPU quota for NC80adis_H100_v5 adjust region if needed |
| 54 | + |
| 55 | +# 2. Login to Azure |
| 56 | +az login |
| 57 | + |
| 58 | +# 3. Run deployment |
| 59 | +bash ./scripts/quick-deploy.sh |
| 60 | +``` |
| 61 | + |
| 62 | +This will: |
| 63 | +1. Create Azure resources (resource group, storage, ACR, managed identity) |
| 64 | +2. Create AKS cluster with GPU node pool |
| 65 | +3. Setup GPU monitoring (Prometheus + Grafana) |
| 66 | +4. Build and push Docker images |
| 67 | +5. Deploy fine-tuning job (~20 mins) |
| 68 | +6. Deploy inference service with Web UI |
| 69 | + |
| 70 | +## Monitor Fine-tuning |
| 71 | + |
| 72 | +```bash |
| 73 | +kubectl get jobs -n workloads |
| 74 | +kubectl logs job/gpt-oss-finetune -n workloads -f |
| 75 | +``` |
| 76 | + |
| 77 | +## GPU Monitoring |
| 78 | + |
| 79 | +The monitoring script automatically sets up: |
| 80 | +- Azure Monitor Workspace (Managed Prometheus) |
| 81 | +- Azure Managed Grafana with DCGM dashboard |
| 82 | +- NVIDIA DCGM metrics scraping |
| 83 | + |
| 84 | +Access Grafana URL from the script output or Azure Portal. |
| 85 | + |
| 86 | +**Test queries in Grafana Explore:** |
| 87 | +- `DCGM_FI_DEV_GPU_UTIL` - GPU utilization |
| 88 | +- `DCGM_FI_DEV_GPU_TEMP` - GPU temperature |
| 89 | +- `DCGM_FI_DEV_FB_USED` - GPU memory usage |
| 90 | + |
| 91 | +Alternatively, visit the DCGM Dashboard for GPU metrics visualization. |
| 92 | + |
| 93 | +## Access Inference Service |
| 94 | + |
| 95 | +```bash |
| 96 | +kubectl get svc gpt-oss-inference -n workloads |
| 97 | +# Use the EXTERNAL-IP to access the Web UI |
| 98 | +``` |
| 99 | + |
| 100 | +> **⚠️ Security Note:** The inference service uses a public `LoadBalancer` without authentication—suitable for demos only. For production, use an internal load balancer, network policies, or an API gateway with authentication. |
| 101 | +
|
| 102 | +## Individual Steps |
| 103 | + |
| 104 | +```bash |
| 105 | +./scripts/01-setup-azure-resources.sh # Azure resources |
| 106 | +./scripts/02-create-aks-cluster.sh # AKS cluster (GPU pool at 0 nodes) |
| 107 | +./scripts/03-setup-gpu-monitoring.sh # Prometheus + Grafana |
| 108 | +./scripts/04-build-and-push-image.sh # Docker build |
| 109 | +./scripts/05-deploy-finetune.sh # Scales up GPU, deploys job |
| 110 | +./scripts/06-deploy-inference.sh # Inference service |
| 111 | +``` |
| 112 | + |
| 113 | +## Cost Optimization |
| 114 | + |
| 115 | +**GPU nodes start at 0** - The GPU nodepool is created with 0 nodes to avoid idle costs (~$20/hr). |
| 116 | +Script 05/06 automatically scales up when deploying workloads. |
| 117 | + |
| 118 | +```bash |
| 119 | +# Manual scale down after training |
| 120 | +az aks nodepool update \ |
| 121 | + --resource-group <rg-name> --cluster-name <cluster-name> \ |
| 122 | + --name gpupool --min-count 0 --max-count 0 |
| 123 | +``` |
| 124 | + |
| 125 | +## kubectl Context |
| 126 | + |
| 127 | +When creating a new cluster, a new context is added to your kubeconfig: |
| 128 | +```bash |
| 129 | +kubectl config get-contexts # List all contexts |
| 130 | +kubectl config use-context <name> # Switch to a different cluster |
| 131 | +``` |
| 132 | + |
| 133 | +## Notes |
| 134 | + |
| 135 | +- Fine-tuning typically takes ~20 minutes on H100 |
| 136 | +- GPU nodes take 5-10 minutes to provision when scaling up |
| 137 | +- GPU metrics take 3-5 minutes to appear in Grafana after setup |
| 138 | +- Uses NVIDIA PyTorch NGC container for optimized performance |
| 139 | +- Requires Azure subscription with GPU quota for NC80adis_H100_v5 |
0 commit comments