You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: .github/skills/iac-coder/SKILL.md
+1Lines changed: 1 addition & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -87,6 +87,7 @@ If a gate cannot be run locally, state exactly what was not run and why.
87
87
- Use Private Endpoints for all PaaS services
88
88
- Set subnets as Private Subnets (Zero Trust)
89
89
- Use existing VNet provided by platform team
90
+
- After purging/recreating the AI Foundry resource, allow ~5 min for PE DNS propagation before running integration tests or the next apply — the destroy script confirms API-level deletion but not DNS propagation. See the [Failure Playbook](references/REFERENCE.md#️⚠️-critical-ai-foundry-private-endpoint-broken-after-purgeapply-deploymentnotfound-404) for full diagnosis steps.
**Root cause:** The AI Foundry private endpoint (PE) is in a broken/inconsistent state. APIM can resolve the hub hostname but the PE is not correctly routing traffic. Azure returns `DeploymentNotFound` instead of a connectivity error, making it easy to misdiagnose as a deployment naming or RBAC issue.
211
+
212
+
**Most common trigger:** Manually purging the AI Foundry resource (or a full `terraform destroy` of the shared stack) then immediately re-applying. Even though Azure confirms PE deletion and PE recreation via the API, the PE's NIC/DNS binding can be stale for several minutes after Terraform reports success.
213
+
214
+
**Diagnosis:**
215
+
1. Rule out APIM/policy by checking if `document-intelligence.bats` passes (per-tenant DocInt resources, different backend) — if DocInt passes and OpenAI fails, the issue is hub PE, not APIM.
216
+
2. Check APIM MSI role: `az role assignment list --scope <hub_id> --assignee <apim_msi_principal_id> --query "[].roleDefinitionName"`
4. If both are fine, **delete the Foundry private endpoint from the Azure portal** (or via `az network private-endpoint delete`) and re-apply to force a clean PE recreation.
219
+
220
+
**Fix:** Delete the private endpoint resource and re-apply the `shared` stack. Terraform will recreate it cleanly.
221
+
222
+
**Teardown script behaviour vs DNS propagation gap:**
223
+
- `deploy-scaled.sh destroy` **does** block until full completion: each phase uses `wait "${pids[$i]}"` and Terraform confirms every resource deleted via the Azure API before exiting.
- **Gap:** There is no post-destroy sleep for Azure's private DNS propagation after PE deletion. A rapid `destroy` + `apply` of the `shared` stack can recreate the Foundry PE with a stale NIC/DNS binding. Add a manual wait of ~5 minutes between destroy and apply when working with Foundry PE recreation, or delete only the PE (not the whole hub) when possible.
226
+
227
+
---
228
+
202
229
### Terraform drift or noisy plans
203
230
- Re-check lifecycle blocks and `ignore_changes` intent before adding new ignores.
204
231
- Verify module input defaults and conditional counts are stable.
description: Guidance for the network module's subnet allocation, CIDR calculations, NSG rules, and delegation requirements in ai-hub-tracking. Use when adding subnets, modifying address space allocation, changing NSG rules, or debugging subnet delegation issues.
3
+
description: Guidance for the network module's subnet allocation, CIDR mapping, NSG rules, PE pool outputs, and delegation requirements in ai-hub-tracking. Use when adding subnets, modifying address allocation, changing NSG rules, updating PE pool logic, or debugging subnet delegation issues.
4
4
---
5
5
6
6
# Network Module Skills
7
7
8
-
Use this skill profile when creating or modifying subnet allocation, CIDR calculations, NSG rules, or delegation configuration in the network module.
8
+
Use this skill profile when creating or modifying subnet allocation, CIDR mapping, NSG rules, PE pool outputs, or delegation configuration in the network module.
9
9
10
10
## Use When
11
11
- Adding a new subnet type to the network module
12
-
- Modifying CIDR allocation logic in `locals.tf`
12
+
- Modifying subnet allocation in `params/{env}/shared.tfvars`
13
13
- Changing or debugging NSG security rules for any subnet
| Shared stack outputs |`infra-ai-hub/stacks/shared/outputs.tf`| PE pool pass-through + backward-compat outputs |
50
+
| Per-env config |`infra-ai-hub/params/{env}/shared.tfvars` → `subnet_allocation`| Full CIDRs per subnet per address space |
51
+
| Tenant PE selection |`infra-ai-hub/stacks/tenant/locals.tf`| PE subnet resolution with 3-tier precedence |
52
+
| APIM PE selection |`infra-ai-hub/stacks/apim/locals.tf`| Pinned PE subnet resolution with fallback |
50
53
51
54
## Architecture
52
55
53
56
VNets are pre-provisioned by the BC Gov Landing Zone — the module only creates **subnets within existing VNets**. Subnets are created in the **shared stack** and consumed by downstream stacks via `data.terraform_remote_state.shared`.
54
57
55
58
All subnets use `azapi_resource` (not `azurerm_subnet`) because Landing Zone policy requires NSG at creation time — `azapi_resource` does this atomically.
56
59
57
-
## Current Subnet Allocation Map
60
+
## Subnet Allocation Model (`subnet_allocation`)
58
61
59
-
| Subnet | Delegation | Allocation Order |
62
+
The network module uses a single `subnet_allocation` variable of type `map(map(string))`:
63
+
-**Outer key** = address space CIDR (e.g., `"10.x.x.0/24"`)
64
+
-**Inner key** = subnet name (e.g., `"privateendpoints-subnet"`)
65
+
-**Inner value** = full subnet CIDR (e.g., `"10.x.x.0/27"`)
66
+
67
+
There is **no offset computation** — all CIDRs are explicit in tfvars. The module reads them directly via `merge()`.
68
+
69
+
### Known Subnet Names
70
+
71
+
| Subnet Name | Delegation | Purpose |
60
72
|---|---|---|
61
-
| PE | None (`privateEndpointNetworkPolicies = "Disabled"`) | Always first |
62
-
| APIM |`Microsoft.Web/serverFarms`| After PE |
63
-
| AppGW | None (dedicated, no delegation) | After APIM |
64
-
| ACA |`Microsoft.App/environments`| After AppGW |
65
-
| Func |`Microsoft.Web/serverFarms`| After ACA |
73
+
|`privateendpoints-subnet`| None (`privateEndpointNetworkPolicies = "Disabled"`) | Primary PE subnet |
74
+
|`privateendpoints-subnet-<n>`| None | Additional PE pool subnets (`<n>` starts at 1: `-1`, `-2`, ...) |
Optional map of external project names to their peered VNet config. When populated, the network module creates dynamic inbound NSG rules on the APIM subnet allowing direct HTTPS (443) traffic from these peered VNets — bypassing App Gateway. NSGs are stateful, so no outbound mirror rule is needed.
**Prod** — 4 address spaces (placeholder CIDRs, not yet deployed):
116
+
| Space | Subnet | CIDR | Notes |
117
+
|---|---|---|---|
118
+
| Space 1 |`privateendpoints-subnet`| TBD /24 | PE pool space 1 |
119
+
| Space 2 |`privateendpoints-subnet-1`| TBD /24 | PE pool space 2 |
120
+
| Space 3 |`privateendpoints-subnet-2`| TBD /24 | PE pool space 3 |
121
+
| Space 4 |`apim-subnet`, `appgw-subnet`, `aca-subnet`| TBD /27s | Workload space |
122
+
123
+
## PE Subnet Pool
124
+
125
+
The network module automatically derives a PE pool from all subnets whose name starts with `privateendpoints-subnet`:
126
+
- Pool keys use the actual subnet names: `privateendpoints-subnet`, `privateendpoints-subnet-1`, `privateendpoints-subnet-2`, ...
127
+
- Primary PE subnet key is always `privateendpoints-subnet` (the original, suffix-free name)
128
+
- Pool outputs: `private_endpoint_subnet_ids_by_key`, `private_endpoint_subnet_cidrs_by_key`, `private_endpoint_subnet_keys_ordered`
129
+
130
+
### Downstream PE Consumption
131
+
132
+
**Tenant stack** — `pe_subnet_key` is **mandatory** for every enabled tenant:
133
+
-**Explicit `pe_subnet_key` in tenant config** (`var.tenants[key].pe_subnet_key`) — **ALWAYS set**, validated at plan time
134
+
- Resolution is strict: invalid/missing key in the shared PE pool fails at plan time (no silent fallback)
135
+
136
+
Each tenant creates up to 5 PEs (Key Vault, AI Search, Cosmos DB, Document Intelligence, Speech Services). All PEs for a tenant land on the **same** subnet ("tenant affinity"). Storage Account has no PE (public access in Landing Zone).
137
+
138
+
Shared stack PEs (AI Foundry Hub, Language Service, Hub Key Vault) always use the primary `privateendpoints-subnet` (~4-5 PEs).
139
+
140
+
### PE Subnet Assignment Strategy
141
+
142
+
**Principle: assign-on-first-deploy, sticky forever.** Changing `pe_subnet_key` after deployment destroys and recreates **all 5 tenant PEs** (service disruption + DNS re-propagation).
143
+
144
+
**Capacity math:**
145
+
- Each `/24` PE subnet holds ~251 usable IPs (Azure reserves 5)
146
+
- Each tenant consumes up to 5 PE IPs → ~50 tenants per `/24` subnet
147
+
- Shared stack consumes ~5 PEs on primary subnet (reducing tenant capacity to ~49 on primary)
148
+
- Prod has 3 PE subnets → theoretical max ~148 tenants
149
+
150
+
**Assignment rules for new tenants:**
151
+
1. Check current PE count per subnet (Azure Portal → subnet → Connected devices, or `az network vnet subnet show`)
152
+
2. Assign the subnet with the most remaining capacity
153
+
3. Record the key in the tenant's `pe_subnet_key` field — it is immutable after first apply
154
+
4. Dev/test environments have only 1 PE subnet → always `"privateendpoints-subnet"`
155
+
156
+
**Tenant onboarding prerequisite:**
157
+
Every new tenant tfvars **must** include `pe_subnet_key` inside the `tenant = { ... }` block. Terraform plan will fail validation if it is missing. Example:
158
+
```hcl
159
+
pe_subnet_key = "privateendpoints-subnet" # or "privateendpoints-subnet-1", etc.
160
+
```
161
+
162
+
**APIM stack** — Pinned PE subnet:
163
+
1. Explicit `var.apim_pe_subnet_key` (if set, looks up from shared PE pool)
164
+
2. Fallback to primary `private_endpoint_subnet_id`
79
165
80
-
Each infra subnet's offset shifts by 32 for each enabled preceding subnet. See [references/REFERENCE.md](references/REFERENCE.md) for full CIDR calculation algorithm and visual diagrams.
166
+
**Key-rotation / Foundry** — Out of PE pool scope (no PE subnetreferences).
0 commit comments