Production-ready framework for orchestrating robotics and AI workloads on Azure using NVIDIA Isaac Lab, Isaac Sim, and OSMO.
| Capability | Description |
|---|---|
| Infrastructure as Code | Terraform modules for reproducible Azure deployments |
| Dual Orchestration | Submit jobs via AzureML or OSMO |
| Workload Identity | Key-less auth via Azure AD (setup guide) |
| Private Networking | Services on private VNet with VPN gateway for cluster access (client setup) |
| MLflow Integration | Experiment tracking with Azure ML (details) |
| GPU Scheduling | KAI Scheduler for efficient utilization |
| Auto-scaling | Pay-per-use GPU compute on AKS Spot nodes |
The infrastructure deploys an AKS cluster with GPU node pools running the NVIDIA GPU Operator and KAI Scheduler. Training workloads can be submitted via OSMO workflows (control plane and backend operator) and AzureML jobs (ML extension). Both platforms share common infrastructure: Azure Storage for checkpoints and data, Key Vault for secrets, and Azure Container Registry for container images. OSMO additionally uses PostgreSQL for workflow state and Redis for caching.
Azure Infrastructure (deployed by Terraform):
| Component | Purpose |
|---|---|
| Virtual Network | Private networking with NAT Gateway and DNS Resolver |
| Private Endpoints | Secure access to Azure services (7 endpoints, 11+ DNS zones) |
| AKS Cluster | Kubernetes with GPU Spot node pools and Workload Identity |
| Key Vault | Secrets management with RBAC authorization |
| Azure ML Workspace | Experiment tracking, model registry |
| Storage Account | Training data, checkpoints, and workflow artifacts |
| Container Registry | Training and OSMO container images |
| Azure Monitor | Log Analytics, Prometheus metrics, Managed Grafana |
| PostgreSQL | OSMO workflow state persistence |
| Redis | OSMO job queue and caching |
| VPN Gateway | Point-to-Site and Site-to-Site connectivity (required for private cluster access) |
Kubernetes Components (deployed by setup scripts):
| Component | Purpose |
|---|---|
| NVIDIA GPU Operator | GPU drivers, device plugin, DCGM metrics exporter |
| KAI Scheduler | GPU-aware scheduling with bin-packing |
| AzureML Extension | ML training and inference job submission |
| OSMO Control Plane | Workflow API, router, and web interface |
| OSMO Backend Operator | Workflow execution on cluster |
βοΈ = Optional component
Note
Running both AzureML and OSMO on the same cluster? Create separate GPU node pools for each platform. AzureML uses Volcano while OSMO uses KAI Schedulerβthese schedulers don't share resource visibility. Without dedicated pools, jobs from one platform may fail when the other is using GPU resources. Configure node selectors and taints to isolate workloads.
OSMO orchestration on Azure enables production-scale robotics training across industries:
| Use Case | Training Scenario |
|---|---|
| Warehouse AMRs | Navigation policies with 1000+ parallel environments, checkpointing to Azure Storage |
| Manufacturing Arms | Manipulation strategies with physics-accurate simulation on pay-per-use GPU |
| Legged Robots | Locomotion optimization with MLflow tracking for sim-to-real transfer |
| Collaborative Robots | Safe interaction policies with Azure Monitor logging for compliance |
| Tool | Version | Installation |
|---|---|---|
| Azure CLI | 2.50+ | brew install azure-cli |
| Terraform | 1.9.8+ | brew install terraform |
| kubectl | 1.28+ | brew install kubectl |
| Helm | 3.x | brew install helm |
| jq | latest | brew install jq |
| OSMO CLI | latest | See NVIDIA docs |
- Azure subscription with Contributor + Role Based Access Control Administrator
- Scope: Subscription (if creating new resource group) or Resource Group (if using existing)
- Terraform creates role assignments for managed identities
- Alternative: Owner (grants more permissions than required)
- GPU VM quota for your target region (e.g.,
Standard_NV36ads_A10_v5)
cd deploy/001-iac
source ../000-prerequisites/az-sub-init.sh
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars with your values
terraform init && terraform apply -var-file=terraform.tfvarsFor automation and additional configuration, see deploy/001-iac/README.md.
The default configuration creates a private AKS cluster. Deploy the VPN Gateway to access the cluster:
cd vpn
cp terraform.tfvars.example terraform.tfvars
# Edit terraform.tfvars - must match parent deployment values
terraform init && terraform apply -var-file=terraform.tfvarsSee VPN client setup for connecting from your local machine.
Note
Skip this step if you set should_enable_private_aks_cluster = false for a public AKS control plane. See Network Configuration Modes for hybrid options that keep Azure services private while allowing public cluster access.
cd ../../002-setup
# Get cluster credentials (resource group and cluster name from terraform output)
az aks get-credentials --resource-group <rg> --name <aks>
# Verify connectivity (requires VPN for private clusters)
kubectl cluster-info
# Deploy GPU infrastructure
./01-deploy-robotics-charts.sh
# Deploy AzureML extension
./02-deploy-azureml-extension.sh
# Deploy OSMO
./03-deploy-osmo-control-plane.sh
./04-deploy-osmo-backend.shOSMO Training β Submits to NVIDIA OSMO orchestrator:
# Quick training run (100 iterations for testing)
./scripts/submit-osmo-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 --max-iterations 100
# Full training with custom environments
./scripts/submit-osmo-training.sh --task Isaac-Velocity-Rough-Anymal-D-v0 --num-envs 4096
# Resume from checkpoint
./scripts/submit-osmo-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 \
--checkpoint-uri "runs:/<run-id>/checkpoints" --checkpoint-mode resumeAzureML Training β Submits to Azure Machine Learning:
# Quick training run
./scripts/submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 --max-iterations 100
# Full training with log streaming
./scripts/submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-D-v0 --num-envs 4096 --stream
# Resume training from registered model
./scripts/submit-azureml-training.sh --task Isaac-Velocity-Rough-Anymal-C-v0 \
--checkpoint-uri "azureml://models/isaac-velocity-rough-anymal-c-v0/versions/1" \
--checkpoint-mode resumeAzureML Validation β Validates a trained model:
# Validate latest model version (model name derived from task)
./scripts/submit-azureml-validation.sh --task Isaac-Velocity-Rough-Anymal-C-v0
# Validate specific model version with custom episodes
./scripts/submit-azureml-validation.sh --model-name isaac-velocity-rough-anymal-c-v0 \
--model-version 2 --eval-episodes 200
# Validate with streaming logs
./scripts/submit-azureml-validation.sh --model-name my-policy --streamTip: Run any script with
--helpfor all available options.
| Scenario | Storage Auth | Registry | Use Case |
|---|---|---|---|
| Access Keys | Keys | nvcr.io | Development |
| Workload Identity | Federated | nvcr.io | Production |
| Workload Identity + ACR | Federated | Private ACR | Air-gapped |
See 002-setup/README.md for detailed instructions.
.
βββ deploy/
β βββ 000-prerequisites/ # Azure CLI and provider setup
β βββ 001-iac/ # Terraform infrastructure
β βββ 002-setup/ # Cluster configuration scripts
βββ scripts/
β βββ submit-azureml-*.sh # AzureML job submission
β βββ submit-osmo-*.sh # OSMO workflow submission
βββ workflows/
β βββ azureml/ # AzureML job templates
β βββ osmo/ # OSMO workflow templates
βββ src/training/ # Training code
βββ docs/ # Additional documentation
| Guide | Description |
|---|---|
| Deploy Overview | Deployment order and quick path |
| Infrastructure | Terraform configuration and modules |
| Cluster Setup | Scripts and deployment scenarios |
| Scripts | Training and validation submission |
| Workflows | Job and workflow templates |
| MLflow Integration | Experiment tracking setup |
Use the Azure Pricing Calculator to estimate costs. Add these services based on the architecture:
| Service | Configuration | Notes |
|---|---|---|
| Azure Kubernetes Service (AKS) | System pool: Standard_D4s_v3 (3 nodes) | Always-on control plane |
| Virtual Machines (Spot) | Standard_NV36ads_A10_v5 or NC-series | GPU nodes scale to zero when idle |
| Azure Database for PostgreSQL | Flexible Server, Burstable B1ms | OSMO workflow state |
| Azure Cache for Redis | Basic C0 or Standard C1 | OSMO job queue |
| Azure Machine Learning | Basic workspace | No additional compute costs (uses AKS) |
| Storage Account | Standard LRS, ~100GB | Checkpoints and datasets |
| Container Registry | Basic or Standard | Image storage |
| Log Analytics | ~5GB/day ingestion | Monitoring data |
| Azure Managed Grafana | Essential tier | Dashboards (optional) |
| VPN Gateway | VpnGw1 | Point-to-site access (optional) |
GPU Spot VMs provide significant savings (60-90%) compared to on-demand pricing. Actual costs depend on training frequency, job duration, and data volumes.
MIT License. See LICENSE.md.
- microsoft/edge-ai β Infrastructure components
- NVIDIA Isaac Lab β RL framework
- NVIDIA OSMO β Workflow orchestration