Skip to content

AKS Automatic | Foundry Portal | Agent Builder | Central Governace #119

@mishraomp

Description

@mishraomp

currently agent building capabilities are out of scope, once the hub matures, agent building capabilities could be added to portal itself, where business people can create their own agents in the AI hub with all central governance, metadata and guardrails in place.

High level breakdown of AKS compute and infra analysis which is a must have for foundry agent building capabilities

AKS Automatic Feasibility Assessment

Date: 2026-02-27
Last Updated: 2026-02-27 (revised with cited sources)
Scope: Evaluate whether AKS Automatic can be added to the AI Hub infrastructure given current network allocation across dev, test, and prod environments.
Assumption: BYO (Bring Your Own) VNet — AI Hub integrates with existing BC Gov Landing Zone networking, not AKS Automatic's default managed VNet.


Executive Summary

AKS Automatic is feasible within the current AI Hub network layout for dev-sized clusters (≤25 working nodes with surge headroom) in dev and test today — no changes required. Production requires a request for additional /24 address space from the BC Gov Landing Zone team, sized to actual cluster and node pool requirements (see Prod Sizing Math).

With Azure CNI Overlay (default for AKS Automatic [1]), only AKS nodes consume VNet subnet IPs — pods use an overlay CIDR (default 10.244.0.0/16). However, two hidden limiters apply: (a) each node claims a /24 from the pod CIDR, so a /16 pod CIDR caps you at 256 nodes [3], and (b) rolling upgrades require n + max_surge IPs, leaving effective capacity below raw usable IP count [6].

AKS Automatic uses the Standard pricing tier (not Premium) at ~$73/month for the control plane [2]. az aks stop is not supported with NAP-enabled clusters [8], so cost optimization relies on scaling user node pools to 0, Spot VMs, and right-sizing.

All traffic from overlay pods to destinations outside the cluster is SNAT'd to the node IP — pods are not directly reachable from the VNet or peered networks [5]. This is an architectural constraint that must be designed around using LoadBalancer services or Ingress.

Deployment Safeguards are on by default in AKS Automatic with Warn level, enforcing Kubernetes best practices via Azure Policy + Gatekeeper [9]. Baseline Pod Security Standards cannot be disabled [9].

Estimated sustainment effort: 0.5 FTE at AI Hub's current scale.


Table of Contents


Network Feasibility

Current Address Space Allocation

Env Address Spaces Type Subnets In Use Free in Infra /24
Dev 1 × /24 (10.46.15.0/24) Single shared /24 PE /27 + APIM /27 + ACA /27 = 96 IPs .96 onward — 160 IPs
Test 2 × /24s Dedicated PE + infra /24s APIM /27 + AppGW /27 + ACA /27 = 96 IPs .96 onward — 160 IPs
Prod Not yet defined Greenfield

AKS Automatic Network Requirements (BYO VNet)

AKS Automatic defaults to a managed VNet where Microsoft controls the VNet. For AI Hub, we need BYO VNet to integrate with BC Gov Landing Zone networking [1].

With BYO VNet, AKS Automatic uses Azure CNI Overlay powered by Cilium [1] — pod IPs come from an overlay CIDR (default 10.244.0.0/16), not from the VNet subnet. Only node IPs consume subnet addresses.

Component Subnet Needed Delegation Required?
Node pool subnet /27 minimum, /24 recommended for scaling No — node subnets must NOT be delegated [4]
API Server VNet Integration subnet /28 dedicated (minimum) Yes — delegated to Microsoft.ContainerService/managedClusters [7]

Important: The node subnet and the API server subnet are separate. Only the API server subnet requires delegation [7]. Using a delegated subnet for nodes will cause deployment to fail [4].

Verdict Per Environment

Env AKS /27 Fits? AKS /24 Fits? Action Required
Dev Yes (25 working nodes + surge) No (no room) Add a /28 for API server VNet integration
Test Yes (25 working nodes + surge) No (no room in current /24s) Request 1 additional /24 from Landing Zone if scaling beyond 25 nodes
Prod Plan ahead Plan ahead See Prod Sizing Math

VNet Model: Managed vs BYO

AKS Automatic supports two VNet models [1]:

Feature Managed VNet (Default) BYO VNet (AI Hub)
VNet ownership Microsoft-managed Customer-managed
Egress AKS managed NAT gateway Customer choice: LB, NAT GW, or UDR
Subnet control Automatic Customer creates node + API server subnets
Node subnet delegation Not applicable Not required (node subnet must NOT be delegated) [4]
API server subnet delegation Not applicable Required: Microsoft.ContainerService/managedClusters [7]
RBAC on VNet Not applicable Cluster identity needs Network Contributor on subnet [7]
Private cluster Not applicable Supported with custom VNet [1]

AI Hub decision: BYO VNet is required because all infrastructure exists within BC Gov Landing Zone-allocated address spaces. The managed VNet option is not compatible with existing PE, APIM, and AppGW subnets.


IP Addressing Deep Dive

Azure Reserved IPs Per Subnet

Every Azure subnet reserves 5 IPs regardless of size:

IP Reserved For
.0 Network address
.1 Default gateway
.2 Azure DNS mapping
.3 Azure DNS mapping
Last IP Broadcast

Usable IPs by Prefix Length

Prefix Total IPs Usable (minus 5 reserved) Effective with Surge (n−1)
/28 16 11 10
/27 32 27 26
/26 64 59 58
/25 128 123 122
/24 256 251 250

The "Effective with Surge" column accounts for 1 IP reserved for rolling upgrade operations (default max_surge = 1) [6].

Why 26 Working Nodes in a /27 (Not 27)

With Azure CNI Overlay, only nodes consume subnet IPs (1 IP per node). However, rolling upgrades require surge capacity [6]:

  • /27 = 32 total IPs − 5 reserved = 27 usable IPs
  • Minus 1 for max_surge during rolling upgrades = 26 working nodes
  • If max_surge is set higher (e.g., 10%), subtract accordingly

The IP planning docs state: "Your node count is then n + number-of-additional-scaled-nodes-you-anticipate + max surge" [6].


Pod CIDR and the 256-Node Cap

Each node is assigned a /24 address space carved out of the pod CIDR [3]:

"Each node is assigned a /24 address space carved out from the same CIDR. Extra nodes created when you scale out a cluster automatically receive /24 address spaces from the same CIDR." — [3]

With the default pod CIDR of 10.244.0.0/16:

Pod CIDR Size Available /24 Blocks Max Nodes
/16 (default) 256 256
/15 512 512
/14 1,024 1,024

This is a hidden limiter: even with a /24 node subnet (251 usable IPs), the default /16 pod CIDR only supports 256 nodes. For clusters approaching this limit, use --pod-cidr to specify a larger CIDR [3]. The pod CIDR can be expanded after creation [3].

For AI Hub's expected scale (<50 nodes), the default /16 is sufficient.


Overlay SNAT and Pod Reachability

Azure CNI Overlay fundamentally changes how pod traffic flows compared to flat networks [5]:

"Communication with endpoints outside the cluster, such as on-premises and peered virtual networks, uses the node IP through network address translation (NAT). Azure CNI translates the source IP (overlay IP of the pod) of the traffic to the primary IP address of the VM." — [3]

"Endpoints outside the cluster can't connect to a pod directly. You have to publish the pod's application as a Kubernetes Load Balancer service to make it reachable on the virtual network." — [3]

Implications for AI Hub

Traffic Direction Behavior Design Impact
Pod → VNet/on-prem SNAT'd to node IP; pod IP hidden No direct pod IP visibility in VNet flow logs
Pod → Internet Via Load Balancer, NAT Gateway, or UDR [3] Configure egress method in BYO VNet
VNet → Pod Not possible directly Must use LoadBalancer Service, Ingress, or Internal LB
Pod → Pod (same cluster) Direct overlay routing, no SNAT No performance penalty [3]

Key takeaway: Any service that needs to be reached by other VNet resources (PE connections, on-prem, APIM backends) must be exposed via a Kubernetes Service (LoadBalancer/ClusterIP + Ingress). This is standard Kubernetes practice but worth calling out for teams accustomed to VM-based networking where every workload has a routable VNet IP.


Capacity Planning

Pods and Namespaces

Namespaces are purely logical — zero IP cost. Thousands can be created regardless of subnet size.

Pods with CNI Overlay get IPs from the overlay CIDR (default 10.244.0.0/16). Each node gets a /24 from the pod CIDR, supporting up to 250 pods/node. The default max is 250 for CNI Overlay [10].

/27 Subnet Capacity (26 working nodes after surge headroom)

Max Pods/Node Total Pods Use Case
30 (min) 780 Small dev cluster
110 (default) 2,860 Standard workloads
250 (max) 6,500 Dense packing

/24 Subnet Capacity (250 working nodes after surge headroom)

Max Pods/Node Total Pods Use Case
30 (min) 7,500 Conservative
110 (default) 27,500 Production at scale
250 (max) 62,500 Max density

What Limits What

Resource Bottleneck Subnet-Dependent?
Namespaces K8s soft limit ~10,000 No
Pods Overlay CIDR + (nodes × pods/node) Indirectly (via node count)
Nodes Min of: subnet usable IPs, pod CIDR /24 blocks Yes (subnet) and Yes (pod CIDR)
Containers Multiple per pod; no separate IP No

Surge and Upgrade Headroom

The Problem with a Full /27

The Azure IP planning docs explicitly warn [6]:

"When you upgrade your AKS cluster, a new node is deployed in the cluster. Services and workloads begin to run on the new node, and an older node is removed from the cluster. This rolling upgrade process requires a minimum of one additional block of IP addresses to be available."

For a /27 (27 usable IPs):

Scenario Surge Needed Max Working Nodes IPs Left for Ops
Default upgrade (max_surge=1) 1 node 26 0 — fully packed
10% surge on 20 nodes 2 nodes 25 0
Node replacement (1 failed) 1 node 26 0

Recommendation: In a /27, plan for no more than 25 working nodes to leave operational breathing room for simultaneous surge + node replacement. If the cluster needs >20 nodes, move to a /26 or /24.


Cost Analysis

Subnet size (/27 vs /24) has zero impact on cost — subnets are free. You only pay for nodes (VMs) that exist.

AKS Automatic Pricing Tier

AKS Automatic requires the Standard tier (automatically selected, cannot be changed) [2]:

"Automatic SKU clusters: Must use the Standard tier (automatically selected during cluster creation)." — [2]

Tier Hourly Cost Monthly (~730 hrs) SKU Notes
Free $0 $0 Base only Up to 1,000 nodes, no SLA
Standard $0.10/hr ~$73/mo Base or Automatic Required for AKS Automatic. Uptime SLA included.
Premium $0.10/hr + LTS ~$73/mo + LTS cost Base only Includes Long Term Support

Previous version of this doc incorrectly stated Premium tier at $0.16/hr. Standard tier at $0.10/hr is the correct and only option for AKS Automatic [2].

Idle Cluster Cost (No User Pods)

Even with zero user workloads, AKS Automatic runs system components (CoreDNS, metrics-server, etc.) on system nodes managed by NAP.

Component Monthly Estimate Notes
AKS Standard tier (required) [2] ~$73 $0.10/hr × 730 hrs
System node pool (min 2 nodes) [11] ~$140–280 2× Standard_D4s_v5 (NAP-managed size)
Load Balancer (Standard) ~$18 Base + rules
OS Disk (per node) ~$15–30 2× 128 GiB Premium SSD
Egress/Logs ~$5–20 Minimal for idle
Total idle cluster ~$250–420/mo

Note on system nodes: AKS Automatic uses NAP (Karpenter) to manage all node pools, including system nodes. System nodes run on customer-billed VMs — the system pool requires at least 2 nodes with minimum 4 vCPUs each [11]. NAP decides the VM SKU but the compute cost is charged to the customer's subscription.

/27 vs /24 Cost Comparison

/27 /24
Subnet cost $0 $0
Idle cluster cost ~$250–420/mo ~$250–420/mo
Max working nodes (with surge) 26 250
Max user pods (default 250/node) ~6,500 ~62,500

Cost Optimization

az aks stop is NOT supported for NAP-enabled clusters (including all AKS Automatic clusters) [8]:
"You can't stop clusters which use the Node Autoprovisioning (NAP) feature." — [8]

Available cost optimization strategies:

  • Scale user node pools to 0 — NAP automatically removes user nodes when no pods need scheduling (system nodes remain) [8]
  • Spot VMs for user node pools — 60–90% compute savings (supported with AKS Automatic) [1]
  • Right-size workloads — Deployment Safeguards enforces resource requests/limits, preventing over-provisioning [9]
  • Planned maintenance windows — set schedules for auto-upgrades to reduce disruption [1]

Deployment Safeguards (Governance)

Deployment Safeguards are on by default in AKS Automatic [9]:

"Deployment Safeguards is turned on by default in AKS Automatic." — [9]

Enforcement Levels

Level Behavior Default in AKS Automatic?
Warn Warning messages displayed; request proceeds Yes (default)
Enforce Non-compliant deployments denied/mutated Optional — must be explicitly enabled

What Gets Enforced

Deployment Safeguards includes these built-in policies [9]:

Policy Effect (Warn) Effect (Enforce)
Resource requests/limits required Warning Mutates: sets defaults (500m CPU, 2Gi memory) and minimums (100m CPU, 100Mi memory)
Anti-affinity / topology spread Warning Mutates: adds pod anti-affinity and topology spread constraints
No latest image tag Warning Denied
Liveness/readiness probes required Warning Denied
CSI driver for storage classes Warning Denied if using in-tree provisioners
Reserved system pool taints Warning Mutates: removes CriticalAddonsOnly from user pools
Unique service selectors Warning Denied

Baseline Pod Security Standards

"Baseline Pod Security Standards are now turned on by default in AKS Automatic. The baseline Pod Security Standards in AKS Automatic can't be turned off." — [9]

This enforces restrictions on: host namespaces, privileged containers, host ports, AppArmor profiles, SELinux, /proc mount, seccomp profiles, and sysctls [9].

Governance Model for AI Hub

Decision Recommendation
Initial level Start with Warn (default) to audit without blocking
Production level Move to Enforce after reviewing warnings for 2–4 weeks
PSS level Baseline (on by default, cannot be turned off)
Custom policies Add via Azure Policy assignments; no need for third-party engines
Namespace exclusions Exclude infra namespaces (e.g., monitoring) if they need elevated privileges

Operational Sustainment

What AKS Automatic Manages

Area Detail Source
Node OS patching Auto-patched, Azure Linux OS [1] [1]
Node scaling NAP (Karpenter-based) built-in — no tuning needed [1] [1]
K8s version upgrades Auto-upgraded; planned maintenance windows supported [1] [1]
etcd Managed control plane — no backup/restore
Network policy Cilium built-in — no third-party CNI to manage [1] [1]
Monitoring Managed Prometheus + Container Insights auto-configured [1] [1]
Policy enforcement Deployment Safeguards + Baseline PSS on by default [9] [9]
Node resource group Fully managed — locked to prevent accidental changes [1] [1]

What You Still Own

Helm Releases (~40% of ongoing work)

Task Frequency Effort
Helm chart upgrades (ingress, cert-manager, etc.) Monthly via Renovate ~2–4 hrs/mo
Helm values drift detection Continuous (GitOps) Automated with Flux/ArgoCD
Chart breaking changes Quarterly ~4–8 hrs per major bump
New service onboarding As needed ~2–4 hrs per service

Note: AKS Automatic includes managed NGINX ingress via the application routing add-on [1], which may reduce the need for self-managed ingress Helm charts.

Azure Policy / Deployment Safeguards Management (~20% of ongoing work)

AKS Automatic includes Deployment Safeguards (Azure Policy + Gatekeeper) as the built-in governance engine [9]. This is Microsoft-managed and supported.

Task Frequency Effort
Review Deployment Safeguards warnings Weekly ~1 hr/week
Recommend Enforce level after audit period One-time (after first month) ~4 hrs
Policy exemptions for new workloads As needed ~30 min each
Namespace exclusion management As services change ~1 hr each

Key advantage: Deployment Safeguards (including upgrades, patches, and Gatekeeper compatibility) are managed by Microsoft as part of AKS Automatic [9]. All-or-nothing — you cannot selectively disable individual policies [9].

Application Concerns (~30% of ongoing work)

Task Frequency Effort
Deployment troubleshooting Ongoing ~2–6 hrs/week
Resource quota/limit tuning Monthly ~2 hrs
Secret rotation (Key Vault CSI driver) Mostly automated ~1 hr/quarter
Ingress/TLS certificate management Automated via cert-manager or app routing add-on ~1–2 hrs/month
Custom dashboards and alerts One-time + iteration ~4–8 hrs initial, ~2 hrs/month

Security & Compliance (~10%)

Task Frequency Effort
Image vulnerability scanning (Defender for Containers) Continuous ~1 hr/week triaging
RBAC/namespace access reviews Monthly ~2 hrs
Network policy updates (Cilium) As services change ~1–2 hrs each

Staffing Recommendation

Scale FTE Needed Profile
AI Hub current (<10 services, 1–2 clusters) 0.5 FTE One platform engineer at 50%
Medium (20–50 services, 3 envs) 1–2 FTE Dedicated platform/SRE
Large (100+ services, multi-region) 3–5 FTE Platform team with on-call

Time Saved vs AKS Standard

Task AKS Standard AKS Automatic Monthly Savings
Node pool management Manual Eliminated (NAP) ~4–8 hrs
K8s upgrades Plan + test + execute Automatic ~3–5 hrs (amortized)
OS patching Schedule + drain + cordon Automatic ~4 hrs
Autoscaler tuning Manual Cluster Autoscaler Built-in NAP (Karpenter) ~4 hrs
Monitoring setup Manual Prometheus stack Pre-configured ~16–24 hrs (one-time)
Policy setup Manual Azure Policy config Deployment Safeguards on by default ~8 hrs (one-time)
Total ~30–50 hrs/month

Karpenter / NAP Caveats

AKS Automatic uses Node Auto-Provisioning (NAP), which is Microsoft's managed deployment of Karpenter [8]:

"Node auto-provisioning (NAP) simplifies this process by automatically provisioning and managing the optimal VM configuration for your workloads... NAP automatically deploys, configures, and manages Karpenter on your AKS clusters" — [8]

Key Limitations

Limitation Impact Workaround
az aks stop not supported [8] Cannot stop cluster to save costs Scale user node pools to 0; system nodes always run
Cannot use Cluster Autoscaler alongside NAP [8] One scaling engine only NAP replaces Cluster Autoscaler entirely
Windows node pools not supported [8] Linux-only workloads N/A for AI Hub (Linux-only)
Cannot change egress outbound type after creation [8] Plan egress model at cluster creation Choose LB/NAT GW/UDR upfront
IPv6 clusters not supported [8] IPv4 only N/A for AI Hub (IPv4)
Service principals not supported [8] Must use managed identity Already using managed identity in AI Hub

NAP vs Self-Hosted Karpenter

Aspect NAP (AKS Automatic) Self-hosted Karpenter
Installation Managed by Microsoft Manual Helm deployment
Upgrades Automatic with cluster Manual chart upgrades
VM selection Optimized by AKS based on workload Full control via NodePool CRDs
Disruption policies Configurable via AKS API Full Karpenter config
Support Microsoft-supported Community/self-supported

Recommendation

Decision Recommendation Source
VNet model BYO VNet — required for BC Gov Landing Zone integration [1]
Node subnet for dev/test /27 — 26 working nodes with surge headroom [6]
Node subnet for prod /25 or /24 — see Prod Sizing Math [6]
Node subnet delegation None — node subnets must NOT be delegated [4]
API server subnet Dedicated /28, delegated to Microsoft.ContainerService/managedClusters [7]
CNI mode Azure CNI Overlay (AKS Automatic default) — maximizes node density [1]
Pod CIDR Default /16 sufficient for <256 nodes; expand if needed [3]
Pricing tier Standard (only option for Automatic; ~$73/mo) [2]
Policy enforcement Deployment Safeguards — start at Warn, promote to Enforce [9]
GitOps Flux (AKS extension) preferred — Microsoft-managed lifecycle [1]
Staffing 0.5 FTE at current scale
Cost optimization Scale to 0 + Spot VMs (NOT az aks stop) [8]
When to start When a concrete workload needs container orchestration beyond what Container Apps provides

Prod Sizing Math

Instead of an arbitrary "4+ /24s", here is the actual math for prod based on BYO VNet:

Single Cluster Scenario

Component CIDR Size IPs Needed Notes
Node subnet /24 251 usable (250 with surge) Supports up to 250 working nodes
API server VNet integration /28 11 usable Minimum size, shared across node pools [7]
Total new address space /24 + /28 ~267 IPs

Multi-Cluster Scenario (e.g., separate AKS per env)

Component Per Cluster 3 Clusters (dev/test/prod) Notes
Node subnets 1 × /27 (dev), 1 × /27 (test), 1 × /24 (prod) /27 + /27 + /24 Size per expected scale
API server subnets 1 × /28 each 3 × /28 Can share /28 if clusters are in same VNet [7]
Total new address space ~1 /24 + 2 /27s + 3 /28s ~370 IPs

What to Request from Landing Zone

Env Existing Allocation Additional Needed for AKS Request
Dev 1 × /24 /27 (node) + /28 (API server) — fits in existing free space None
Test 2 × /24s /27 (node) + /28 (API server) — fits in existing free space None
Prod None /24 (node) + /28 (API server) 1 × /24 (or /25 if <123 nodes expected)

Note: The previous version of this document recommended "4+ /24s" for prod without justification. The actual requirement is 1 additional /24 for a single production AKS cluster, or scale accordingly per cluster count. Each additional AKS cluster needs its own node subnet but can share the API server subnet.


Sources

# Source URL
[1] AKS Automatic Overview https://learn.microsoft.com/en-us/azure/aks/intro-aks-automatic
[2] AKS Pricing Tiers (Free, Standard, Premium) https://learn.microsoft.com/en-us/azure/aks/free-standard-pricing-tiers
[3] Azure CNI Overlay Concepts https://learn.microsoft.com/en-us/azure/aks/concepts-network-azure-cni-overlay
[4] AKS CNI Networking Overview (subnet delegation prohibition) https://learn.microsoft.com/en-us/azure/aks/concepts-network-cni-overview
[5] Azure CNI Overlay Configuration https://learn.microsoft.com/en-us/azure/aks/azure-cni-overlay
[6] IP Address Planning for AKS Clusters https://learn.microsoft.com/en-us/azure/aks/concepts-network-ip-address-planning
[7] NAP Custom VNet (API server subnet delegation) https://learn.microsoft.com/en-us/azure/aks/node-auto-provisioning-custom-vnet
[8] Node Auto-Provisioning (NAP) Overview + Start/Stop Cluster https://learn.microsoft.com/en-us/azure/aks/node-auto-provisioning / https://learn.microsoft.com/en-us/azure/aks/start-stop-cluster
[9] Deployment Safeguards https://learn.microsoft.com/en-us/azure/aks/deployment-safeguards
[10] IP Address Planning — Max Pods per Node https://learn.microsoft.com/en-us/azure/aks/concepts-network-ip-address-planning#maximum-pods-per-node
[11] Manage System Node Pools https://learn.microsoft.com/en-us/azure/aks/use-system-pools

Metadata

Metadata

Type

Projects

Status

Backlog

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions