Azure
diff --git a/‎examples/aks-lora-finetuning/.gitignore‎
Lines changed: 37 additions & 0 deletions b/‎examples/aks-lora-finetuning/.gitignore‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎examples/aks-lora-finetuning/README.md‎
Lines changed: 139 additions & 0 deletions b/‎examples/aks-lora-finetuning/README.md‎
Lines changed: 139 additions & 0 deletions
diff --git a/‎examples/aks-lora-finetuning/config.sh.template‎
Lines changed: 47 additions & 0 deletions b/‎examples/aks-lora-finetuning/config.sh.template‎
Lines changed: 47 additions & 0 deletions
diff --git a/‎examples/aks-lora-finetuning/docker/Dockerfile‎
Lines changed: 30 additions & 0 deletions b/‎examples/aks-lora-finetuning/docker/Dockerfile‎
Lines changed: 30 additions & 0 deletions
diff --git a/‎examples/aks-lora-finetuning/docker/Dockerfile.inference‎
Lines changed: 37 additions & 0 deletions b/‎examples/aks-lora-finetuning/docker/Dockerfile.inference‎
Lines changed: 37 additions & 0 deletions
diff --git a/‎examples/aks-lora-finetuning/k8s/finetune-job.yaml‎
Lines changed: 56 additions & 0 deletions b/‎examples/aks-lora-finetuning/k8s/finetune-job.yaml‎
Lines changed: 56 additions & 0 deletions
@@ -0,0 +1,37 @@
+# Environment & secrets
+config.sh
+config.env
+.env*
+*.key
+*.pem
+secrets/
+
+# Azure & Kubernetes
+.azure/
+*.kubeconfig
+
+# Model & data files (large)
+*.bin
+*.safetensors
+models/
+checkpoints/
+.cache/
+
+# Python
+__pycache__/
+*.pyc
+venv/
+.venv/
+
+# OS & IDE
+.DS_Store
+Thumbs.db
+.idea/
+.vscode/settings.json
+*.swp
+
+# Temporary
+*.log
+*.tmp
+temp/
+tmp/
@@ -0,0 +1,139 @@
+# AKS GPU Fine-tuning with LoRA
+
+Fine-tune and deploy large language models on Azure Kubernetes Service (AKS) with GPU support using LoRA (Low-Rank Adaptation).
+
+## Use Case
+
+**Problem:** Organizations need AI models that perform internal reasoning in a specific language (e.g., for regulatory audits) while maintaining flexibility in input/output languages for end users.
+
+**Solution:** LoRA fine-tuning to modify the model's reasoning behavior—something not achievable through RAG, prompt engineering, or agentic approaches.
+
+**Example:** A Swiss bank requires all AI reasoning traces to be in German for audit compliance, but wants customers to interact in any language. The fine-tuned model:
+- Receives a question in English, French, or any language
+- Performs all internal chain-of-thought reasoning in German
+- Responds to the user in their original language
+
+This repo demonstrates fine-tuning GPT-OSS 20B to achieve this behavior using LoRA on Azure Kubernetes Service with H100 GPUs.
+
+## Architecture
+
+![Architecture](media/Architecture.png)
+
+## Demo
+
+![Demo](media/Demo.png)
+
+
+## Features
+
+- **Automated Azure Infrastructure**: Creates AKS cluster, ACR, storage, and managed identities
+- **GPU-Optimized**: Uses NVIDIA GPU Operator and NC-series VMs
+- **GPU Monitoring**: Azure Managed Prometheus + Grafana with DCGM metrics
+- **LoRA Fine-tuning**: Efficient fine-tuning with parameter-efficient methods
+- **Side-by-Side Inference**: Compare fine-tuned vs baseline models via Web UI
+
+## Prerequisites
+
+**Tools:**
+- **Azure CLI** - [Install](https://docs.microsoft.com/en-us/cli/azure/install-azure-cli)
+- **kubectl** - [Install](https://kubernetes.io/docs/tasks/tools/) or `az aks install-cli`
+- **Helm** - [Install](https://helm.sh/docs/intro/install/)
+- **Bash shell** - WSL/WSL2 (Windows), Terminal (macOS/Linux), or Git Bash
+
+**Azure:**
+- Azure subscription with **Owner** or **Contributor + User Access Administrator** role
+- GPU quota for `Standard_NC80adis_H100_v5` (80 vCPUs or 1 node) in your region
+- Request quota at: [Azure Portal → Quotas](https://portal.azure.com/#view/Microsoft_Azure_Capacity/QuotaMenuBlade/~/myQuotas)
+## Quick Deploy
+
+```bash
+# 1. Configure your environment
+cp config.sh.template config.sh
+# Edit config.sh with your Azure subscription and settings
+# Requires Azure subscription with GPU quota for NC80adis_H100_v5 adjust region if needed
+
+# 2. Login to Azure
+az login
+
+# 3. Run deployment
+bash ./scripts/quick-deploy.sh
+```
+
+This will:
+1. Create Azure resources (resource group, storage, ACR, managed identity)
+2. Create AKS cluster with GPU node pool
+3. Setup GPU monitoring (Prometheus + Grafana)
+4. Build and push Docker images
+5. Deploy fine-tuning job (~20 mins)
+6. Deploy inference service with Web UI
+
+## Monitor Fine-tuning
+
+```bash
+kubectl get jobs -n workloads
+kubectl logs job/gpt-oss-finetune -n workloads -f
+```
+
+## GPU Monitoring
+
+The monitoring script automatically sets up:
+- Azure Monitor Workspace (Managed Prometheus)
+- Azure Managed Grafana with DCGM dashboard
+- NVIDIA DCGM metrics scraping
+
+Access Grafana URL from the script output or Azure Portal.
+
+**Test queries in Grafana Explore:**
+- `DCGM_FI_DEV_GPU_UTIL` - GPU utilization
+- `DCGM_FI_DEV_GPU_TEMP` - GPU temperature
+- `DCGM_FI_DEV_FB_USED` - GPU memory usage
+
+Alternatively, visit the DCGM Dashboard for GPU metrics visualization.
+
+## Access Inference Service
+
+```bash
+kubectl get svc gpt-oss-inference -n workloads
+# Use the EXTERNAL-IP to access the Web UI
+```
+
+> **⚠️ Security Note:** The inference service uses a public `LoadBalancer` without authentication—suitable for demos only. For production, use an internal load balancer, network policies, or an API gateway with authentication.
+
+## Individual Steps
+
+```bash
+./scripts/01-setup-azure-resources.sh   # Azure resources
+./scripts/02-create-aks-cluster.sh      # AKS cluster (GPU pool at 0 nodes)
+./scripts/03-setup-gpu-monitoring.sh    # Prometheus + Grafana
+./scripts/04-build-and-push-image.sh    # Docker build
+./scripts/05-deploy-finetune.sh         # Scales up GPU, deploys job
+./scripts/06-deploy-inference.sh        # Inference service
+```
+
+## Cost Optimization
+
+**GPU nodes start at 0** - The GPU nodepool is created with 0 nodes to avoid idle costs (~$20/hr). 
+Script 05/06 automatically scales up when deploying workloads.
+
+```bash
+# Manual scale down after training
+az aks nodepool update \
+    --resource-group <rg-name> --cluster-name <cluster-name> \
+    --name gpupool --min-count 0 --max-count 0
+```
+
+## kubectl Context
+
+When creating a new cluster, a new context is added to your kubeconfig:
+```bash
+kubectl config get-contexts        # List all contexts
+kubectl config use-context <name>  # Switch to a different cluster
+```
+
+## Notes
+
+- Fine-tuning typically takes ~20 minutes on H100
+- GPU nodes take 5-10 minutes to provision when scaling up
+- GPU metrics take 3-5 minutes to appear in Grafana after setup
+- Uses NVIDIA PyTorch NGC container for optimized performance
+- Requires Azure subscription with GPU quota for NC80adis_H100_v5
@@ -0,0 +1,47 @@
+# AKS GPU Fine-tuning Project Configuration Template
+# Copy this to config.sh and fill in your values
+# DO NOT commit config.sh to version control
+
+# Fix Git Bash path conversion issue on Windows (safe to keep on all platforms)
+# Set this in your shell: export MSYS_NO_PATHCONV=1
+MSYS_NO_PATHCONV=1
+
+# Azure Subscription
+SUBSCRIPTION_ID=""
+
+# Project Configuration
+PROJECT_NAME="aks-gpu-finetune"
+LOCATION="switzerlandnorth"
+
+# UNIQUE SUFFIX - Required for globally unique resource names!
+# Option 1: Set your own suffix (e.g., your initials + random: "cp1234")
+# Option 2: Leave empty to auto-generate on first run
+UNIQUE_SUFFIX=""
+
+# Resource Names (defaults are set from PROJECT_NAME in scripts)
+# Override below ONLY if you want fully custom names
+RESOURCE_GROUP_NAME=""
+AKS_CLUSTER_NAME=""
+# For custom storage/ACR names, uncomment and set these in your config.env
+STORAGE_ACCOUNT_NAME=""
+ACR_NAME=""
+
+# Storage configuration
+STORAGE_CONTAINER_NAME="models"
+STORAGE_DATASET_CONTAINER="datasets"
+
+# AKS configuration
+AKS_SYSTEM_NODE_COUNT=1
+AKS_SYSTEM_NODE_SIZE="Standard_D2s_v6"
+AKS_GPU_NODE_POOL_NAME="gpupool"
+AKS_GPU_NODE_MIN_COUNT=0  # Start at 0 to avoid idle GPU costs (~$20/hr)
+AKS_GPU_NODE_MAX_COUNT=1  # Max 1 node - scripts scale up/down as needed
+AKS_GPU_NODE_SIZE="Standard_NC80adis_H100_v5"  # 2x H100 GPUs per node
+AKS_NODE_DISK_SIZE=100
+AKS_VERSION="1.32"
+
+# Container Registry
+ACR_SKU="Basic"
+
+# Common tags (set dynamically from PROJECT_NAME in scripts)
+TAGS=""
@@ -0,0 +1,30 @@
+# Dockerfile for GPT-OSS Fine-tuning on H100 GPU
+# Base: NVIDIA PyTorch container (includes CUDA, cuDNN, PyTorch pre-optimized)
+# Using PyTorch NGC container for better performance and faster builds
+
+FROM nvcr.io/nvidia/pytorch:25.01-py3
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1
+
+# Install ML dependencies for fine-tuning
+RUN pip install --no-cache-dir \
+    "trl>=0.20.0" \
+    "peft>=0.17.0" \
+    "transformers>=4.55.0" \
+    "bitsandbytes>=0.41.0" \
+    "accelerate>=0.20.0"
+
+# Install Azure SDK for blob storage integration
+RUN pip install --no-cache-dir \
+    azure-storage-blob \
+    azure-identity
+
+# Create working directory
+WORKDIR /workspace
+
+# Copy training script from src folder
+COPY src/finetune.py /workspace/
+
+# Set entrypoint
+ENTRYPOINT ["python", "finetune.py"]
@@ -0,0 +1,37 @@
+# Inference Dockerfile for GPT-OSS-20B on H100 GPU
+# Base: NVIDIA PyTorch container (includes CUDA, cuDNN, PyTorch pre-optimized)
+
+FROM nvcr.io/nvidia/pytorch:25.01-py3
+
+# Set environment variables
+ENV PYTHONUNBUFFERED=1 \
+    PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True
+
+# Install Flask for web UI, Azure SDK, and ML packages
+RUN pip install --no-cache-dir \
+    flask \
+    gunicorn \
+    azure-storage-blob \
+    azure-identity \
+    "transformers>=4.55.0" \
+    "peft>=0.17.0" \
+    "accelerate>=0.20.0" \
+    "bitsandbytes>=0.41.0" \
+    sentencepiece \
+    protobuf
+
+# Set working directory
+WORKDIR /workspace
+
+# Copy unified inference engine, web UI, and HTML template from src folder
+COPY src/model_inference.py /workspace/model_inference.py
+COPY src/web-ui.py /workspace/web-ui.py
+COPY src/index.html /workspace/index.html
+RUN chmod +x /workspace/model_inference.py /workspace/web-ui.py
+
+# Expose port for web UI
+EXPOSE 8080
+
+# Run web UI by default (can override with CLI: model_inference.py --model [finetuned|baseline|both])
+CMD ["python3", "/workspace/web-ui.py"]
+
@@ -0,0 +1,56 @@
+apiVersion: batch/v1
+kind: Job
+metadata:
+  name: gpt-oss-finetune
+  namespace: workloads
+spec:
+  # Allow 0 retries to prevent pod cleanup before we can check logs
+  backoffLimit: 0
+  # Keep failed pods for debugging
+  ttlSecondsAfterFinished: 3600
+  
+  template:
+    metadata:
+      labels:
+        app: gpt-oss-finetune
+        azure.workload.identity/use: "true"  # Enable workload identity
+    spec:
+      # Use workload identity service account for blob storage access
+      serviceAccountName: workload-identity-sa
+      
+      # Tolerate GPU node taints
+      tolerations:
+      - key: nvidia.com/gpu
+        operator: Equal
+        value: "true"
+        effect: NoSchedule
+      
+      restartPolicy: Never
+      
+      containers:
+      - name: finetune
+        image: __ACR_NAME__.azurecr.io/gpt-oss-finetune:latest
+        imagePullPolicy: Always
+        
+        # Environment variables for Azure Storage
+        env:
+        - name: STORAGE_ACCOUNT_NAME
+          value: "__STORAGE_ACCOUNT_NAME__"
+        - name: MODEL_CONTAINER
+          value: "models"
+        - name: DATASET_CONTAINER
+          value: "datasets"
+        
+        # Request 1 H100 GPU (Standard_NC40ads_H100_v5: 40 vCPUs, 96GB RAM)
+        resources:
+          requests:
+            nvidia.com/gpu: 1
+            memory: "80Gi"
+            cpu: "32"
+          limits:
+            nvidia.com/gpu: 1
+            memory: "96Gi"
+            cpu: "40"
+        
+        # Set working directory
+        workingDir: /workspace