AKS Automatic | Foundry Portal | Agent Builder | Central Governace

currently agent building capabilities are out of scope, once the hub matures, agent building capabilities could be added to portal itself, where business people can create their own agents  in the AI hub with all central governance, metadata and guardrails in place.

High level breakdown of AKS compute and infra analysis which is a must have for foundry agent building capabilities

# AKS Automatic Feasibility Assessment

> **Date:** 2026-02-27
> **Last Updated:** 2026-02-27 (revised with cited sources)
> **Scope:** Evaluate whether AKS Automatic can be added to the AI Hub infrastructure given current network allocation across dev, test, and prod environments.
> **Assumption:** BYO (Bring Your Own) VNet — AI Hub integrates with existing BC Gov Landing Zone networking, not AKS Automatic's default managed VNet.

---

## Executive Summary

AKS Automatic is feasible within the current AI Hub network layout for **dev-sized clusters** (≤25 working nodes with surge headroom) in dev and test today — no changes required. Production requires a request for additional /24 address space from the BC Gov Landing Zone team, sized to actual cluster and node pool requirements (see [Prod Sizing Math](#prod-sizing-math)).

With **Azure CNI Overlay** (default for AKS Automatic [1]), only AKS nodes consume VNet subnet IPs — pods use an overlay CIDR (default `10.244.0.0/16`). However, **two hidden limiters** apply: (a) each node claims a /24 from the pod CIDR, so a /16 pod CIDR caps you at **256 nodes** [3], and (b) rolling upgrades require **n + max_surge** IPs, leaving effective capacity below raw usable IP count [6].

AKS Automatic uses the **Standard pricing tier** (not Premium) at **~$73/month** for the control plane [2]. `az aks stop` is **not supported** with NAP-enabled clusters [8], so cost optimization relies on scaling user node pools to 0, Spot VMs, and right-sizing.

All traffic from overlay pods to destinations outside the cluster is **SNAT'd to the node IP** — pods are not directly reachable from the VNet or peered networks [5]. This is an architectural constraint that must be designed around using LoadBalancer services or Ingress.

**Deployment Safeguards** are on by default in AKS Automatic with `Warn` level, enforcing Kubernetes best practices via Azure Policy + Gatekeeper [9]. Baseline Pod Security Standards cannot be disabled [9].

Estimated sustainment effort: **0.5 FTE** at AI Hub's current scale.

---

## Table of Contents

- [Network Feasibility](#network-feasibility)
- [VNet Model: Managed vs BYO](#vnet-model-managed-vs-byo)
- [IP Addressing Deep Dive](#ip-addressing-deep-dive)
- [Pod CIDR and the 256-Node Cap](#pod-cidr-and-the-256-node-cap)
- [Overlay SNAT and Pod Reachability](#overlay-snat-and-pod-reachability)
- [Capacity Planning](#capacity-planning)
- [Surge and Upgrade Headroom](#surge-and-upgrade-headroom)
- [Cost Analysis](#cost-analysis)
- [Deployment Safeguards (Governance)](#deployment-safeguards-governance)
- [Operational Sustainment](#operational-sustainment)
- [Karpenter / NAP Caveats](#karpenter--nap-caveats)
- [Recommendation](#recommendation)
- [Prod Sizing Math](#prod-sizing-math)
- [Sources](#sources)

---

## Network Feasibility

### Current Address Space Allocation

| Env | Address Spaces | Type | Subnets In Use | Free in Infra /24 |
|-----|---------------|------|----------------|--------------------|
| **Dev** | 1 × /24 (`10.46.15.0/24`) | Single shared /24 | PE /27 + APIM /27 + ACA /27 = 96 IPs | `.96` onward — **160 IPs** |
| **Test** | 2 × /24s | Dedicated PE + infra /24s | APIM /27 + AppGW /27 + ACA /27 = 96 IPs | `.96` onward — **160 IPs** |
| **Prod** | Not yet defined | — | — | Greenfield |

### AKS Automatic Network Requirements (BYO VNet)

AKS Automatic defaults to a **managed VNet** where Microsoft controls the VNet. For AI Hub, we need **BYO VNet** to integrate with BC Gov Landing Zone networking [1].

With BYO VNet, AKS Automatic uses **Azure CNI Overlay powered by Cilium** [1] — pod IPs come from an overlay CIDR (default `10.244.0.0/16`), not from the VNet subnet. Only node IPs consume subnet addresses.

| Component | Subnet Needed | Delegation Required? |
|---|---|---|
| **Node pool subnet** | /27 minimum, /24 recommended for scaling | **No** — node subnets must NOT be delegated [4] |
| **API Server VNet Integration subnet** | /28 dedicated (minimum) | **Yes** — delegated to `Microsoft.ContainerService/managedClusters` [7] |

> **Important:** The node subnet and the API server subnet are **separate**. Only the API server subnet requires delegation [7]. Using a delegated subnet for nodes will cause deployment to fail [4].

### Verdict Per Environment

| Env | AKS /27 Fits? | AKS /24 Fits? | Action Required |
|-----|:---:|:---:|---|
| **Dev** | Yes (25 working nodes + surge) | No (no room) | Add a /28 for API server VNet integration |
| **Test** | Yes (25 working nodes + surge) | No (no room in current /24s) | Request 1 additional /24 from Landing Zone if scaling beyond 25 nodes |
| **Prod** | Plan ahead | Plan ahead | See [Prod Sizing Math](#prod-sizing-math) |

---

## VNet Model: Managed vs BYO

AKS Automatic supports two VNet models [1]:

| Feature | Managed VNet (Default) | BYO VNet (AI Hub) |
|---|---|---|
| VNet ownership | Microsoft-managed | Customer-managed |
| Egress | AKS managed NAT gateway | Customer choice: LB, NAT GW, or UDR |
| Subnet control | Automatic | Customer creates node + API server subnets |
| Node subnet delegation | Not applicable | **Not required** (node subnet must NOT be delegated) [4] |
| API server subnet delegation | Not applicable | **Required**: `Microsoft.ContainerService/managedClusters` [7] |
| RBAC on VNet | Not applicable | Cluster identity needs Network Contributor on subnet [7] |
| Private cluster | Not applicable | Supported with custom VNet [1] |

**AI Hub decision:** BYO VNet is required because all infrastructure exists within BC Gov Landing Zone-allocated address spaces. The managed VNet option is not compatible with existing PE, APIM, and AppGW subnets.

---

## IP Addressing Deep Dive

### Azure Reserved IPs Per Subnet

Every Azure subnet reserves 5 IPs regardless of size:

| IP | Reserved For |
|---|---|
| `.0` | Network address |
| `.1` | Default gateway |
| `.2` | Azure DNS mapping |
| `.3` | Azure DNS mapping |
| Last IP | Broadcast |

### Usable IPs by Prefix Length

| Prefix | Total IPs | Usable (minus 5 reserved) | Effective with Surge (n−1) |
|--------|-----------|---------------------------|---------------------------|
| /28 | 16 | **11** | **10** |
| /27 | 32 | **27** | **26** |
| /26 | 64 | **59** | **58** |
| /25 | 128 | **123** | **122** |
| /24 | 256 | **251** | **250** |

> The "Effective with Surge" column accounts for 1 IP reserved for rolling upgrade operations (default max_surge = 1) [6].

### Why 26 Working Nodes in a /27 (Not 27)

With Azure CNI Overlay, only nodes consume subnet IPs (1 IP per node). However, **rolling upgrades require surge capacity** [6]:

- /27 = 32 total IPs − 5 reserved = 27 usable IPs
- Minus 1 for max_surge during rolling upgrades = **26 working nodes**
- If max_surge is set higher (e.g., 10%), subtract accordingly

The IP planning docs state: *"Your node count is then n + number-of-additional-scaled-nodes-you-anticipate + max surge"* [6].

---

## Pod CIDR and the 256-Node Cap

Each node is assigned a **/24 address space** carved out of the pod CIDR [3]:

> *"Each node is assigned a /24 address space carved out from the same CIDR. Extra nodes created when you scale out a cluster automatically receive /24 address spaces from the same CIDR."* — [3]

With the default pod CIDR of `10.244.0.0/16`:

| Pod CIDR Size | Available /24 Blocks | Max Nodes |
|---|---|---|
| /16 (default) | 256 | **256** |
| /15 | 512 | 512 |
| /14 | 1,024 | 1,024 |

**This is a hidden limiter:** even with a /24 node subnet (251 usable IPs), the default /16 pod CIDR only supports 256 nodes. For clusters approaching this limit, use `--pod-cidr` to specify a larger CIDR [3]. The pod CIDR can be expanded after creation [3].

For AI Hub's expected scale (<50 nodes), the default /16 is sufficient.

---

## Overlay SNAT and Pod Reachability

Azure CNI Overlay fundamentally changes how pod traffic flows compared to flat networks [5]:

> *"Communication with endpoints outside the cluster, such as on-premises and peered virtual networks, uses the node IP through network address translation (NAT). Azure CNI translates the source IP (overlay IP of the pod) of the traffic to the primary IP address of the VM."* — [3]

> *"Endpoints outside the cluster can't connect to a pod directly. You have to publish the pod's application as a Kubernetes Load Balancer service to make it reachable on the virtual network."* — [3]

### Implications for AI Hub

| Traffic Direction | Behavior | Design Impact |
|---|---|---|
| **Pod → VNet/on-prem** | SNAT'd to node IP; pod IP hidden | No direct pod IP visibility in VNet flow logs |
| **Pod → Internet** | Via Load Balancer, NAT Gateway, or UDR [3] | Configure egress method in BYO VNet |
| **VNet → Pod** | **Not possible directly** | Must use LoadBalancer Service, Ingress, or Internal LB |
| **Pod → Pod (same cluster)** | Direct overlay routing, no SNAT | No performance penalty [3] |

**Key takeaway:** Any service that needs to be reached by other VNet resources (PE connections, on-prem, APIM backends) must be exposed via a Kubernetes Service (LoadBalancer/ClusterIP + Ingress). This is standard Kubernetes practice but worth calling out for teams accustomed to VM-based networking where every workload has a routable VNet IP.

---

## Capacity Planning

### Pods and Namespaces

**Namespaces** are purely logical — zero IP cost. Thousands can be created regardless of subnet size.

**Pods** with CNI Overlay get IPs from the overlay CIDR (default `10.244.0.0/16`). Each node gets a /24 from the pod CIDR, supporting up to 250 pods/node. The default max is 250 for CNI Overlay [10].

### /27 Subnet Capacity (26 working nodes after surge headroom)

| Max Pods/Node | Total Pods | Use Case |
|---|---|---|
| 30 (min) | 780 | Small dev cluster |
| **110 (default)** | **2,860** | Standard workloads |
| 250 (max) | 6,500 | Dense packing |

### /24 Subnet Capacity (250 working nodes after surge headroom)

| Max Pods/Node | Total Pods | Use Case |
|---|---|---|
| 30 (min) | 7,500 | Conservative |
| **110 (default)** | **27,500** | Production at scale |
| 250 (max) | 62,500 | Max density |

### What Limits What

| Resource | Bottleneck | Subnet-Dependent? |
|---|---|---|
| Namespaces | K8s soft limit ~10,000 | No |
| Pods | Overlay CIDR + (nodes × pods/node) | Indirectly (via node count) |
| Nodes | Min of: subnet usable IPs, pod CIDR /24 blocks | **Yes** (subnet) and **Yes** (pod CIDR) |
| Containers | Multiple per pod; no separate IP | No |

---

## Surge and Upgrade Headroom

### The Problem with a Full /27

The Azure IP planning docs explicitly warn [6]:

> *"When you upgrade your AKS cluster, a new node is deployed in the cluster. Services and workloads begin to run on the new node, and an older node is removed from the cluster. This rolling upgrade process requires a minimum of one additional block of IP addresses to be available."*

For a /27 (27 usable IPs):

| Scenario | Surge Needed | Max Working Nodes | IPs Left for Ops |
|---|---|---|---|
| Default upgrade (max_surge=1) | 1 node | **26** | 0 — fully packed |
| 10% surge on 20 nodes | 2 nodes | **25** | 0 |
| Node replacement (1 failed) | 1 node | **26** | 0 |

**Recommendation:** In a /27, plan for **no more than 25 working nodes** to leave operational breathing room for simultaneous surge + node replacement. If the cluster needs >20 nodes, move to a /26 or /24.

---

## Cost Analysis

Subnet size (/27 vs /24) has **zero impact on cost** — subnets are free. You only pay for nodes (VMs) that exist.

### AKS Automatic Pricing Tier

AKS Automatic **requires the Standard tier** (automatically selected, cannot be changed) [2]:

> *"Automatic SKU clusters: Must use the Standard tier (automatically selected during cluster creation)."* — [2]

| Tier | Hourly Cost | Monthly (~730 hrs) | SKU | Notes |
|---|---|---|---|---|
| Free | $0 | $0 | Base only | Up to 1,000 nodes, no SLA |
| **Standard** | **$0.10/hr** | **~$73/mo** | Base or **Automatic** | **Required for AKS Automatic**. Uptime SLA included. |
| Premium | $0.10/hr + LTS | ~$73/mo + LTS cost | Base only | Includes Long Term Support |

> **Previous version of this doc incorrectly stated Premium tier at $0.16/hr.** Standard tier at $0.10/hr is the correct and only option for AKS Automatic [2].

### Idle Cluster Cost (No User Pods)

Even with zero user workloads, AKS Automatic runs system components (CoreDNS, metrics-server, etc.) on system nodes managed by NAP.

| Component | Monthly Estimate | Notes |
|---|---|---|
| AKS Standard tier (required) [2] | ~$73 | $0.10/hr × 730 hrs |
| System node pool (min 2 nodes) [11] | ~$140–280 | 2× Standard_D4s_v5 (NAP-managed size) |
| Load Balancer (Standard) | ~$18 | Base + rules |
| OS Disk (per node) | ~$15–30 | 2× 128 GiB Premium SSD |
| Egress/Logs | ~$5–20 | Minimal for idle |
| **Total idle cluster** | **~$250–420/mo** | |

> **Note on system nodes:** AKS Automatic uses NAP (Karpenter) to manage all node pools, including system nodes. System nodes run on **customer-billed VMs** — the system pool requires at least 2 nodes with minimum 4 vCPUs each [11]. NAP decides the VM SKU but the compute cost is charged to the customer's subscription.

### /27 vs /24 Cost Comparison

| | /27 | /24 |
|---|---|---|
| Subnet cost | $0 | $0 |
| Idle cluster cost | ~$250–420/mo | ~$250–420/mo |
| Max working nodes (with surge) | 26 | 250 |
| Max user pods (default 250/node) | ~6,500 | ~62,500 |

### Cost Optimization

> **⚠ `az aks stop` is NOT supported** for NAP-enabled clusters (including all AKS Automatic clusters) [8]:
> *"You can't stop clusters which use the Node Autoprovisioning (NAP) feature."* — [8]

Available cost optimization strategies:

- **Scale user node pools to 0** — NAP automatically removes user nodes when no pods need scheduling (system nodes remain) [8]
- **Spot VMs** for user node pools — 60–90% compute savings (supported with AKS Automatic) [1]
- **Right-size workloads** — Deployment Safeguards enforces resource requests/limits, preventing over-provisioning [9]
- **Planned maintenance windows** — set schedules for auto-upgrades to reduce disruption [1]

---

## Deployment Safeguards (Governance)

Deployment Safeguards are **on by default** in AKS Automatic [9]:

> *"Deployment Safeguards is turned on by default in AKS Automatic."* — [9]

### Enforcement Levels

| Level | Behavior | Default in AKS Automatic? |
|---|---|---|
| **Warn** | Warning messages displayed; request proceeds | **Yes** (default) |
| **Enforce** | Non-compliant deployments denied/mutated | Optional — must be explicitly enabled |

### What Gets Enforced

Deployment Safeguards includes these built-in policies [9]:

| Policy | Effect (Warn) | Effect (Enforce) |
|---|---|---|
| Resource requests/limits required | Warning | **Mutates**: sets defaults (500m CPU, 2Gi memory) and minimums (100m CPU, 100Mi memory) |
| Anti-affinity / topology spread | Warning | **Mutates**: adds pod anti-affinity and topology spread constraints |
| No `latest` image tag | Warning | Denied |
| Liveness/readiness probes required | Warning | Denied |
| CSI driver for storage classes | Warning | Denied if using in-tree provisioners |
| Reserved system pool taints | Warning | **Mutates**: removes `CriticalAddonsOnly` from user pools |
| Unique service selectors | Warning | Denied |

### Baseline Pod Security Standards

> *"Baseline Pod Security Standards are now turned on by default in AKS Automatic. The baseline Pod Security Standards in AKS Automatic can't be turned off."* — [9]

This enforces restrictions on: host namespaces, privileged containers, host ports, AppArmor profiles, SELinux, `/proc` mount, seccomp profiles, and sysctls [9].

### Governance Model for AI Hub

| Decision | Recommendation |
|---|---|
| Initial level | Start with **Warn** (default) to audit without blocking |
| Production level | Move to **Enforce** after reviewing warnings for 2–4 weeks |
| PSS level | **Baseline** (on by default, cannot be turned off) |
| Custom policies | Add via Azure Policy assignments; no need for third-party engines |
| Namespace exclusions | Exclude infra namespaces (e.g., monitoring) if they need elevated privileges |

---

## Operational Sustainment

### What AKS Automatic Manages

| Area | Detail | Source |
|---|---|---|
| Node OS patching | Auto-patched, Azure Linux OS [1] | [1] |
| Node scaling | NAP (Karpenter-based) built-in — no tuning needed [1] | [1] |
| K8s version upgrades | Auto-upgraded; planned maintenance windows supported [1] | [1] |
| etcd | Managed control plane — no backup/restore | — |
| Network policy | Cilium built-in — no third-party CNI to manage [1] | [1] |
| Monitoring | Managed Prometheus + Container Insights auto-configured [1] | [1] |
| Policy enforcement | Deployment Safeguards + Baseline PSS on by default [9] | [9] |
| Node resource group | Fully managed — locked to prevent accidental changes [1] | [1] |

### What You Still Own

#### Helm Releases (~40% of ongoing work)

| Task | Frequency | Effort |
|---|---|---|
| Helm chart upgrades (ingress, cert-manager, etc.) | Monthly via Renovate | ~2–4 hrs/mo |
| Helm values drift detection | Continuous (GitOps) | Automated with Flux/ArgoCD |
| Chart breaking changes | Quarterly | ~4–8 hrs per major bump |
| New service onboarding | As needed | ~2–4 hrs per service |

> **Note:** AKS Automatic includes managed NGINX ingress via the application routing add-on [1], which may reduce the need for self-managed ingress Helm charts.

#### Azure Policy / Deployment Safeguards Management (~20% of ongoing work)

AKS Automatic includes **Deployment Safeguards** (Azure Policy + Gatekeeper) as the built-in governance engine [9]. This is Microsoft-managed and supported.

| Task | Frequency | Effort |
|---|---|---|
| Review Deployment Safeguards warnings | Weekly | ~1 hr/week |
| Recommend Enforce level after audit period | One-time (after first month) | ~4 hrs |
| Policy exemptions for new workloads | As needed | ~30 min each |
| Namespace exclusion management | As services change | ~1 hr each |

**Key advantage:** Deployment Safeguards (including upgrades, patches, and Gatekeeper compatibility) are managed by Microsoft as part of AKS Automatic [9]. All-or-nothing — you cannot selectively disable individual policies [9].

#### Application Concerns (~30% of ongoing work)

| Task | Frequency | Effort |
|---|---|---|
| Deployment troubleshooting | Ongoing | ~2–6 hrs/week |
| Resource quota/limit tuning | Monthly | ~2 hrs |
| Secret rotation (Key Vault CSI driver) | Mostly automated | ~1 hr/quarter |
| Ingress/TLS certificate management | Automated via cert-manager or app routing add-on | ~1–2 hrs/month |
| Custom dashboards and alerts | One-time + iteration | ~4–8 hrs initial, ~2 hrs/month |

#### Security & Compliance (~10%)

| Task | Frequency | Effort |
|---|---|---|
| Image vulnerability scanning (Defender for Containers) | Continuous | ~1 hr/week triaging |
| RBAC/namespace access reviews | Monthly | ~2 hrs |
| Network policy updates (Cilium) | As services change | ~1–2 hrs each |

### Staffing Recommendation

| Scale | FTE Needed | Profile |
|---|---|---|
| **AI Hub current** (<10 services, 1–2 clusters) | **0.5 FTE** | One platform engineer at 50% |
| Medium (20–50 services, 3 envs) | 1–2 FTE | Dedicated platform/SRE |
| Large (100+ services, multi-region) | 3–5 FTE | Platform team with on-call |

### Time Saved vs AKS Standard

| Task | AKS Standard | AKS Automatic | Monthly Savings |
|---|---|---|---|
| Node pool management | Manual | Eliminated (NAP) | ~4–8 hrs |
| K8s upgrades | Plan + test + execute | Automatic | ~3–5 hrs (amortized) |
| OS patching | Schedule + drain + cordon | Automatic | ~4 hrs |
| Autoscaler tuning | Manual Cluster Autoscaler | Built-in NAP (Karpenter) | ~4 hrs |
| Monitoring setup | Manual Prometheus stack | Pre-configured | ~16–24 hrs (one-time) |
| Policy setup | Manual Azure Policy config | Deployment Safeguards on by default | ~8 hrs (one-time) |
| **Total** | | | **~30–50 hrs/month** |

---

## Karpenter / NAP Caveats

AKS Automatic uses **Node Auto-Provisioning (NAP)**, which is Microsoft's managed deployment of Karpenter [8]:

> *"Node auto-provisioning (NAP) simplifies this process by automatically provisioning and managing the optimal VM configuration for your workloads... NAP automatically deploys, configures, and manages Karpenter on your AKS clusters"* — [8]

### Key Limitations

| Limitation | Impact | Workaround |
|---|---|---|
| **`az aks stop` not supported** [8] | Cannot stop cluster to save costs | Scale user node pools to 0; system nodes always run |
| **Cannot use Cluster Autoscaler alongside NAP** [8] | One scaling engine only | NAP replaces Cluster Autoscaler entirely |
| **Windows node pools not supported** [8] | Linux-only workloads | N/A for AI Hub (Linux-only) |
| **Cannot change egress outbound type after creation** [8] | Plan egress model at cluster creation | Choose LB/NAT GW/UDR upfront |
| **IPv6 clusters not supported** [8] | IPv4 only | N/A for AI Hub (IPv4) |
| **Service principals not supported** [8] | Must use managed identity | Already using managed identity in AI Hub |

### NAP vs Self-Hosted Karpenter

| Aspect | NAP (AKS Automatic) | Self-hosted Karpenter |
|---|---|---|
| Installation | Managed by Microsoft | Manual Helm deployment |
| Upgrades | Automatic with cluster | Manual chart upgrades |
| VM selection | Optimized by AKS based on workload | Full control via NodePool CRDs |
| Disruption policies | Configurable via AKS API | Full Karpenter config |
| Support | Microsoft-supported | Community/self-supported |

---

## Recommendation

| Decision | Recommendation | Source |
|---|---|---|
| **VNet model** | BYO VNet — required for BC Gov Landing Zone integration | [1] |
| **Node subnet for dev/test** | /27 — 26 working nodes with surge headroom | [6] |
| **Node subnet for prod** | /25 or /24 — see [Prod Sizing Math](#prod-sizing-math) | [6] |
| **Node subnet delegation** | **None** — node subnets must NOT be delegated | [4] |
| **API server subnet** | Dedicated /28, delegated to `Microsoft.ContainerService/managedClusters` | [7] |
| **CNI mode** | Azure CNI Overlay (AKS Automatic default) — maximizes node density | [1] |
| **Pod CIDR** | Default /16 sufficient for <256 nodes; expand if needed | [3] |
| **Pricing tier** | Standard (only option for Automatic; ~$73/mo) | [2] |
| **Policy enforcement** | Deployment Safeguards — start at Warn, promote to Enforce | [9] |
| **GitOps** | Flux (AKS extension) preferred — Microsoft-managed lifecycle | [1] |
| **Staffing** | 0.5 FTE at current scale | — |
| **Cost optimization** | Scale to 0 + Spot VMs (NOT `az aks stop`) | [8] |
| **When to start** | When a concrete workload needs container orchestration beyond what Container Apps provides | — |

---

## Prod Sizing Math

Instead of an arbitrary "4+ /24s", here is the actual math for prod based on BYO VNet:

### Single Cluster Scenario

| Component | CIDR Size | IPs Needed | Notes |
|---|---|---|---|
| Node subnet | /24 | 251 usable (250 with surge) | Supports up to 250 working nodes |
| API server VNet integration | /28 | 11 usable | Minimum size, shared across node pools [7] |
| **Total new address space** | **/24 + /28** | **~267 IPs** | |

### Multi-Cluster Scenario (e.g., separate AKS per env)

| Component | Per Cluster | 3 Clusters (dev/test/prod) | Notes |
|---|---|---|---|
| Node subnets | 1 × /27 (dev), 1 × /27 (test), 1 × /24 (prod) | /27 + /27 + /24 | Size per expected scale |
| API server subnets | 1 × /28 each | 3 × /28 | Can share /28 if clusters are in same VNet [7] |
| **Total new address space** | — | **~1 /24 + 2 /27s + 3 /28s** | ~370 IPs |

### What to Request from Landing Zone

| Env | Existing Allocation | Additional Needed for AKS | Request |
|---|---|---|---|
| **Dev** | 1 × /24 | /27 (node) + /28 (API server) — fits in existing free space | **None** |
| **Test** | 2 × /24s | /27 (node) + /28 (API server) — fits in existing free space | **None** |
| **Prod** | None | /24 (node) + /28 (API server) | **1 × /24** (or /25 if <123 nodes expected) |

> **Note:** The previous version of this document recommended "4+ /24s" for prod without justification. The actual requirement is **1 additional /24** for a single production AKS cluster, or scale accordingly per cluster count. Each additional AKS cluster needs its own node subnet but can share the API server subnet.

---

## Sources

| # | Source | URL |
|---|---|---|
| [1] | AKS Automatic Overview | https://learn.microsoft.com/en-us/azure/aks/intro-aks-automatic |
| [2] | AKS Pricing Tiers (Free, Standard, Premium) | https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers |
| [3] | Azure CNI Overlay Concepts | https://learn.microsoft.com/en-us/azure/aks/concepts-network-azure-cni-overlay |
| [4] | AKS CNI Networking Overview (subnet delegation prohibition) | https://learn.microsoft.com/en-us/azure/aks/concepts-network-cni-overview |
| [5] | Azure CNI Overlay Configuration | https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay |
| [6] | IP Address Planning for AKS Clusters | https://learn.microsoft.com/en-us/azure/aks/concepts-network-ip-address-planning |
| [7] | NAP Custom VNet (API server subnet delegation) | https://learn.microsoft.com/en-us/azure/aks/node-auto-provisioning-custom-vnet |
| [8] | Node Auto-Provisioning (NAP) Overview + Start/Stop Cluster | https://learn.microsoft.com/en-us/azure/aks/node-auto-provisioning / https://learn.microsoft.com/en-us/azure/aks/start-stop-cluster |
| [9] | Deployment Safeguards | https://learn.microsoft.com/en-us/azure/aks/deployment-safeguards |
| [10] | IP Address Planning — Max Pods per Node | https://learn.microsoft.com/en-us/azure/aks/concepts-network-ip-address-planning#maximum-pods-per-node |
| [11] | Manage System Node Pools | https://learn.microsoft.com/en-us/azure/aks/use-system-pools |



Env	Address Spaces	Type	Subnets In Use	Free in Infra /24
Dev	1 × /24 (`10.46.15.0/24`)	Single shared /24	PE /27 + APIM /27 + ACA /27 = 96 IPs	`.96` onward — 160 IPs
Test	2 × /24s	Dedicated PE + infra /24s	APIM /27 + AppGW /27 + ACA /27 = 96 IPs	`.96` onward — 160 IPs
Prod	Not yet defined	—	—	Greenfield

Component	Subnet Needed	Delegation Required?
Node pool subnet	/27 minimum, /24 recommended for scaling	No — node subnets must NOT be delegated [4]
API Server VNet Integration subnet	/28 dedicated (minimum)	Yes — delegated to `Microsoft.ContainerService/managedClusters` [7]

Env	AKS /27 Fits?	AKS /24 Fits?	Action Required
Dev	Yes (25 working nodes + surge)	No (no room)	Add a /28 for API server VNet integration
Test	Yes (25 working nodes + surge)	No (no room in current /24s)	Request 1 additional /24 from Landing Zone if scaling beyond 25 nodes
Prod	Plan ahead	Plan ahead	See Prod Sizing Math

Feature	Managed VNet (Default)	BYO VNet (AI Hub)
VNet ownership	Microsoft-managed	Customer-managed
Egress	AKS managed NAT gateway	Customer choice: LB, NAT GW, or UDR
Subnet control	Automatic	Customer creates node + API server subnets
Node subnet delegation	Not applicable	Not required (node subnet must NOT be delegated) [4]
API server subnet delegation	Not applicable	Required: `Microsoft.ContainerService/managedClusters` [7]
RBAC on VNet	Not applicable	Cluster identity needs Network Contributor on subnet [7]
Private cluster	Not applicable	Supported with custom VNet [1]

IP	Reserved For
`.0`	Network address
`.1`	Default gateway
`.2`	Azure DNS mapping
`.3`	Azure DNS mapping
Last IP	Broadcast

Prefix	Total IPs	Usable (minus 5 reserved)	Effective with Surge (n−1)
/28	16	11	10
/27	32	27	26
/26	64	59	58
/25	128	123	122
/24	256	251	250

Traffic Direction	Behavior	Design Impact
Pod → VNet/on-prem	SNAT'd to node IP; pod IP hidden	No direct pod IP visibility in VNet flow logs
Pod → Internet	Via Load Balancer, NAT Gateway, or UDR [3]	Configure egress method in BYO VNet
VNet → Pod	Not possible directly	Must use LoadBalancer Service, Ingress, or Internal LB
Pod → Pod (same cluster)	Direct overlay routing, no SNAT	No performance penalty [3]

Max Pods/Node	Total Pods	Use Case
30 (min)	780	Small dev cluster
110 (default)	2,860	Standard workloads
250 (max)	6,500	Dense packing

Max Pods/Node	Total Pods	Use Case
30 (min)	7,500	Conservative
110 (default)	27,500	Production at scale
250 (max)	62,500	Max density

Resource	Bottleneck	Subnet-Dependent?
Namespaces	K8s soft limit ~10,000	No
Pods	Overlay CIDR + (nodes × pods/node)	Indirectly (via node count)
Nodes	Min of: subnet usable IPs, pod CIDR /24 blocks	Yes (subnet) and Yes (pod CIDR)
Containers	Multiple per pod; no separate IP	No

Scenario	Surge Needed	Max Working Nodes	IPs Left for Ops
Default upgrade (max_surge=1)	1 node	26	0 — fully packed
10% surge on 20 nodes	2 nodes	25	0
Node replacement (1 failed)	1 node	26	0

Tier	Hourly Cost	Monthly (~730 hrs)	SKU	Notes
Free	$0	$0	Base only	Up to 1,000 nodes, no SLA
Standard	$0.10/hr	~$73/mo	Base or Automatic	Required for AKS Automatic. Uptime SLA included.
Premium	$0.10/hr + LTS	~$73/mo + LTS cost	Base only	Includes Long Term Support

Component	Monthly Estimate	Notes
AKS Standard tier (required) [2]	~$73	$0.10/hr × 730 hrs
System node pool (min 2 nodes) [11]	~$140–280	2× Standard_D4s_v5 (NAP-managed size)
Load Balancer (Standard)	~$18	Base + rules
OS Disk (per node)	~$15–30	2× 128 GiB Premium SSD
Egress/Logs	~$5–20	Minimal for idle
Total idle cluster	~$250–420/mo

	/27	/24
Subnet cost	$0	$0
Idle cluster cost	~$250–420/mo	~$250–420/mo
Max working nodes (with surge)	26	250
Max user pods (default 250/node)	~6,500	~62,500

Level	Behavior	Default in AKS Automatic?
Warn	Warning messages displayed; request proceeds	Yes (default)
Enforce	Non-compliant deployments denied/mutated	Optional — must be explicitly enabled

Policy	Effect (Warn)	Effect (Enforce)
Resource requests/limits required	Warning	Mutates: sets defaults (500m CPU, 2Gi memory) and minimums (100m CPU, 100Mi memory)
Anti-affinity / topology spread	Warning	Mutates: adds pod anti-affinity and topology spread constraints
No `latest` image tag	Warning	Denied
Liveness/readiness probes required	Warning	Denied
CSI driver for storage classes	Warning	Denied if using in-tree provisioners
Reserved system pool taints	Warning	Mutates: removes `CriticalAddonsOnly` from user pools
Unique service selectors	Warning	Denied

Decision	Recommendation
Initial level	Start with Warn (default) to audit without blocking
Production level	Move to Enforce after reviewing warnings for 2–4 weeks
PSS level	Baseline (on by default, cannot be turned off)
Custom policies	Add via Azure Policy assignments; no need for third-party engines
Namespace exclusions	Exclude infra namespaces (e.g., monitoring) if they need elevated privileges

Area	Detail	Source
Node OS patching	Auto-patched, Azure Linux OS [1]	[1]
Node scaling	NAP (Karpenter-based) built-in — no tuning needed [1]	[1]
K8s version upgrades	Auto-upgraded; planned maintenance windows supported [1]	[1]
etcd	Managed control plane — no backup/restore	—
Network policy	Cilium built-in — no third-party CNI to manage [1]	[1]
Monitoring	Managed Prometheus + Container Insights auto-configured [1]	[1]
Policy enforcement	Deployment Safeguards + Baseline PSS on by default [9]	[9]
Node resource group	Fully managed — locked to prevent accidental changes [1]	[1]

Task	Frequency	Effort
Helm chart upgrades (ingress, cert-manager, etc.)	Monthly via Renovate	~2–4 hrs/mo
Helm values drift detection	Continuous (GitOps)	Automated with Flux/ArgoCD
Chart breaking changes	Quarterly	~4–8 hrs per major bump
New service onboarding	As needed	~2–4 hrs per service

Task	Frequency	Effort
Review Deployment Safeguards warnings	Weekly	~1 hr/week
Recommend Enforce level after audit period	One-time (after first month)	~4 hrs
Policy exemptions for new workloads	As needed	~30 min each
Namespace exclusion management	As services change	~1 hr each

AKS Automatic | Foundry Portal | Agent Builder | Central Governace #119

Description

AKS Automatic Feasibility Assessment

Executive Summary

Table of Contents

Network Feasibility

Current Address Space Allocation

AKS Automatic Network Requirements (BYO VNet)

Verdict Per Environment

VNet Model: Managed vs BYO

IP Addressing Deep Dive

Azure Reserved IPs Per Subnet

Usable IPs by Prefix Length

Why 26 Working Nodes in a /27 (Not 27)

Pod CIDR and the 256-Node Cap

Overlay SNAT and Pod Reachability

Implications for AI Hub

Capacity Planning

Pods and Namespaces

/27 Subnet Capacity (26 working nodes after surge headroom)

/24 Subnet Capacity (250 working nodes after surge headroom)

What Limits What

Surge and Upgrade Headroom

The Problem with a Full /27

Cost Analysis

AKS Automatic Pricing Tier

Idle Cluster Cost (No User Pods)

/27 vs /24 Cost Comparison

Cost Optimization

Deployment Safeguards (Governance)

Enforcement Levels

What Gets Enforced

Baseline Pod Security Standards

Governance Model for AI Hub

Operational Sustainment

What AKS Automatic Manages

What You Still Own

Helm Releases (~40% of ongoing work)

Azure Policy / Deployment Safeguards Management (~20% of ongoing work)

Application Concerns (~30% of ongoing work)

Security & Compliance (~10%)

Staffing Recommendation

Time Saved vs AKS Standard

Karpenter / NAP Caveats

Key Limitations

NAP vs Self-Hosted Karpenter

Recommendation

Prod Sizing Math

Single Cluster Scenario

Multi-Cluster Scenario (e.g., separate AKS per env)

What to Request from Landing Zone

Sources

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Task	Frequency	Effort
Deployment troubleshooting	Ongoing	~2–6 hrs/week
Resource quota/limit tuning	Monthly	~2 hrs
Secret rotation (Key Vault CSI driver)	Mostly automated	~1 hr/quarter
Ingress/TLS certificate management	Automated via cert-manager or app routing add-on	~1–2 hrs/month
Custom dashboards and alerts	One-time + iteration	~4–8 hrs initial, ~2 hrs/month

Task	Frequency	Effort
Image vulnerability scanning (Defender for Containers)	Continuous	~1 hr/week triaging
RBAC/namespace access reviews	Monthly	~2 hrs
Network policy updates (Cilium)	As services change	~1–2 hrs each

Scale	FTE Needed	Profile
AI Hub current (<10 services, 1–2 clusters)	0.5 FTE	One platform engineer at 50%
Medium (20–50 services, 3 envs)	1–2 FTE	Dedicated platform/SRE
Large (100+ services, multi-region)	3–5 FTE	Platform team with on-call

Task	AKS Standard	AKS Automatic	Monthly Savings
Node pool management	Manual	Eliminated (NAP)	~4–8 hrs
K8s upgrades	Plan + test + execute	Automatic	~3–5 hrs (amortized)
OS patching	Schedule + drain + cordon	Automatic	~4 hrs
Autoscaler tuning	Manual Cluster Autoscaler	Built-in NAP (Karpenter)	~4 hrs
Monitoring setup	Manual Prometheus stack	Pre-configured	~16–24 hrs (one-time)
Policy setup	Manual Azure Policy config	Deployment Safeguards on by default	~8 hrs (one-time)
Total			~30–50 hrs/month

Limitation	Impact	Workaround
`az aks stop` not supported [8]	Cannot stop cluster to save costs	Scale user node pools to 0; system nodes always run
Cannot use Cluster Autoscaler alongside NAP [8]	One scaling engine only	NAP replaces Cluster Autoscaler entirely
Windows node pools not supported [8]	Linux-only workloads	N/A for AI Hub (Linux-only)
Cannot change egress outbound type after creation [8]	Plan egress model at cluster creation	Choose LB/NAT GW/UDR upfront
IPv6 clusters not supported [8]	IPv4 only	N/A for AI Hub (IPv4)
Service principals not supported [8]	Must use managed identity	Already using managed identity in AI Hub

Aspect	NAP (AKS Automatic)	Self-hosted Karpenter
Installation	Managed by Microsoft	Manual Helm deployment
Upgrades	Automatic with cluster	Manual chart upgrades
VM selection	Optimized by AKS based on workload	Full control via NodePool CRDs
Disruption policies	Configurable via AKS API	Full Karpenter config
Support	Microsoft-supported	Community/self-supported

Decision	Recommendation	Source
VNet model	BYO VNet — required for BC Gov Landing Zone integration	[1]
Node subnet for dev/test	/27 — 26 working nodes with surge headroom	[6]
Node subnet for prod	/25 or /24 — see Prod Sizing Math	[6]
Node subnet delegation	None — node subnets must NOT be delegated	[4]
API server subnet	Dedicated /28, delegated to `Microsoft.ContainerService/managedClusters`	[7]
CNI mode	Azure CNI Overlay (AKS Automatic default) — maximizes node density	[1]
Pod CIDR	Default /16 sufficient for <256 nodes; expand if needed	[3]
Pricing tier	Standard (only option for Automatic; ~$73/mo)	[2]
Policy enforcement	Deployment Safeguards — start at Warn, promote to Enforce	[9]
GitOps	Flux (AKS extension) preferred — Microsoft-managed lifecycle	[1]
Staffing	0.5 FTE at current scale	—
Cost optimization	Scale to 0 + Spot VMs (NOT `az aks stop`)	[8]
When to start	When a concrete workload needs container orchestration beyond what Container Apps provides	—

Component	CIDR Size	IPs Needed	Notes
Node subnet	/24	251 usable (250 with surge)	Supports up to 250 working nodes
API server VNet integration	/28	11 usable	Minimum size, shared across node pools [7]
Total new address space	/24 + /28	~267 IPs

Component	Per Cluster	3 Clusters (dev/test/prod)	Notes
Node subnets	1 × /27 (dev), 1 × /27 (test), 1 × /24 (prod)	/27 + /27 + /24	Size per expected scale
API server subnets	1 × /28 each	3 × /28	Can share /28 if clusters are in same VNet [7]
Total new address space	—	~1 /24 + 2 /27s + 3 /28s	~370 IPs

Env	Existing Allocation	Additional Needed for AKS	Request
Dev	1 × /24	/27 (node) + /28 (API server) — fits in existing free space	None
Test	2 × /24s	/27 (node) + /28 (API server) — fits in existing free space	None
Prod	None	/24 (node) + /28 (API server)	1 × /24 (or /25 if <123 nodes expected)