Skip to content

Commit d03cc5a

Browse files
authored
overview
1 parent 08c2c28 commit d03cc5a

File tree

1 file changed

+331
-0
lines changed

1 file changed

+331
-0
lines changed

3-Azure_Capacity_Challenges.md

Lines changed: 331 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1 +1,332 @@
1+
# Azure Capacity Challenges — Causes, Signals, and Proactive Strategies
12

3+
[![GitHub](https://badgen.net/badge/icon/github?icon=github&label)](https://github.com)
4+
[![GitHub](https://img.shields.io/badge/--181717?logo=github&logoColor=ffffff)](https://github.com/)
5+
[brown9804](https://github.com/brown9804)
6+
7+
Last updated: 2025-08-20
8+
9+
-----------------------------
10+
11+
> This community demo is for learning only and uses public documentation. It blends theory and practical examples (no cloud sign-in required). For production guidance, cost/security/compliance, and Azure-specific deployment patterns, contact Microsoft directly: [Microsoft Sales and Support](https://support.microsoft.com/contactus?ContactUsExperienceEntryPointAssetId=S.HP.SMC-HOME)
12+
13+
<details>
14+
<summary><b>List of References</b> (Click to expand)</summary>
15+
16+
- Azure status and service health
17+
- https://status.azure.com
18+
- https://learn.microsoft.com/azure/service-health/overview
19+
- Azure regional services and availability
20+
- https://azure.microsoft.com/global-infrastructure/services/
21+
- https://learn.microsoft.com/azure/availability-zones/az-overview
22+
- VM sizes, SKUs, and quotas
23+
- https://learn.microsoft.com/azure/virtual-machines/sizes
24+
- https://learn.microsoft.com/azure/quotas/quotas-overview
25+
- https://learn.microsoft.com/azure/quotas/per-vm-family-quota-requests
26+
- Capacity error patterns and mitigations
27+
- https://learn.microsoft.com/azure/azure-resource-manager/troubleshooting/error-codes
28+
- https://learn.microsoft.com/azure/virtual-machines/troubleshooting/allocation-failure
29+
- Reservations, savings plans, and scale sets
30+
- https://learn.microsoft.com/azure/cost-management-billing/reservations/save-compute-costs-reservations
31+
- https://learn.microsoft.com/azure/virtual-machine-scale-sets/overview
32+
- AKS scaling and schedulability
33+
- https://learn.microsoft.com/azure/aks/cluster-autoscaler
34+
- https://learn.microsoft.com/azure/aks/start-stop-cluster
35+
- Storage and networking capacity
36+
- https://learn.microsoft.com/azure/storage/common/scalability-targets-standard-account
37+
- https://learn.microsoft.com/azure/azure-resource-manager/management/azure-subscription-service-limits
38+
- Azure Advisor and capacity planning
39+
- https://learn.microsoft.com/azure/advisor/advisor-overview
40+
- Workload identity and regional expansion
41+
- https://learn.microsoft.com/azure/reliability/cross-region-replication-azure
42+
43+
</details>
44+
45+
<details>
46+
<summary><b>Table of Contents</b> (Click to expand)</summary>
47+
48+
- [What are Azure Capacity Challenges?](#what-are-azure-capacity-challenges)
49+
- [Why capacity constraints happen](#why-capacity-constraints-happen)
50+
- [Common signals and error codes](#common-signals-and-error-codes)
51+
- [Proactive planning and design](#proactive-planning-and-design)
52+
- [Operational playbooks (runbooks)](#operational-playbooks-runbooks)
53+
- [Automation examples (CLI/PowerShell/Bicep/KQL)](#automation-examples-clipowershellbicepkql)
54+
- [AKS- and PaaS-specific guidance](#aks--and-paas-specific-guidance)
55+
- [Testing, drill, and validation](#testing-drill-and-validation)
56+
- [Cost, reservations, and risk trade-offs](#cost-reservations-and-risk-trade-offs)
57+
- [Checklist](#checklist)
58+
59+
</details>
60+
61+
> Capacity issues in Azure surface in two broad buckets: quota (soft) limits and physical capacity (hard) constraints. Effective designs anticipate both, offer SKU/region flexibility, and automate detection, fallback, and escalation.
62+
63+
## What are Azure Capacity Challenges?
64+
65+
- Soft constraints: subscription/resource quotas (per-VM family cores, public IPs, NICs, vCPU per region, AKS node pools, etc.)
66+
- Hard constraints: regional/AZ scarcity of specific SKUs, ephemeral capacity during incidents, or burst demand (e.g., GPUs)
67+
- Scope: region-level, zone-level, cluster/rack-level, or specific hardware features (e.g., Ultra Disk, GPUs, NVMe)
68+
69+
<details>
70+
<summary><strong>Capacity risk scenarios</strong></summary>
71+
72+
- New region or AZ not yet enabled for a service/SKU
73+
- Hot SKU (e.g., GPUs, Premium SSD v2, Ultra Disk) in short supply
74+
- Highly constrained shapes (large RAM/CPU, confidential computing)
75+
- Scale-out during an incident or global event
76+
- Zonal pinning creating skew (all demand in a single AZ)
77+
- Strict placement policies (PPG/availability sets) limiting allocatable hosts
78+
79+
</details>
80+
81+
## Why capacity constraints happen
82+
83+
- Demand spikes: seasonal events, marketing launches, or incident-induced migrations
84+
- Hardware specialization: GPUs/NPUs or Ultra Disk clusters are finite per region/AZ
85+
- Zonal affinity: all workloads targeting one zone
86+
- Fixed regional envelopes: datacenter lead times vs. sudden growth
87+
- SKU features mismatch: requiring features not present in selected region/zone
88+
- Quota not aligned: per-VM family vCPU not raised ahead of scale
89+
90+
<details>
91+
<summary><strong>Preventable causes and anti-patterns</strong></summary>
92+
93+
- Single-region dependency without failover
94+
- Tightly constrained SKU choices (one exact size) with no fallbacks
95+
- Overuse of proximity placement groups beyond strict latency needs
96+
- Ignoring per-family quotas during IaC rollouts
97+
- Fixed zonal mappings without elasticity
98+
- Manual-only escalation for quota increases
99+
100+
</details>
101+
102+
## Common signals and error codes
103+
104+
- AllocationFailure: The requested VM size/zone/region currently cannot be allocated
105+
- OverconstrainedAllocationRequest / ZonalAllocationFailed: constraints prevent placement
106+
- QuotaExceeded: Subscription or per-VM-family quota insufficient
107+
- OperationNotAllowed: Service limit reached (e.g., IPs, NICs, disks)
108+
- SKUNotAvailable: Size not available in selected region/zone
109+
- InsufficientMemory/InsufficientCores (service-specific messages)
110+
111+
<details>
112+
<summary><strong>How to confirm and triage</strong></summary>
113+
114+
- Check Service Health and Resource Health for regional advisories
115+
- Query Activity Logs for failed deployments and error codes
116+
- Use What-If before large template rollouts to detect quota gaps
117+
- Attempt allocation in alternate zone or region to isolate scope
118+
- Validate SKU availability programmatically
119+
120+
</details>
121+
122+
## Proactive planning and design
123+
124+
- Multi-AZ and multi-region ready: design for N+1 regions with active/active or active/passive
125+
- SKU flexibility: define a prioritized list of sizes per workload class
126+
- Region flexibility: primary/secondary/tertiary region matrix, aligned to data residency
127+
- Zonal elasticity: allow any-of AZs unless strict locality is required
128+
- Quota-as-code: pre-raise quotas in pipelines; track as configuration
129+
- Use scale sets with mixed or flexible orchestration modes
130+
- Reservations/Savings Plans for steady base; burst on-demand
131+
- For GPUs or Ultra Disk: pre-provision warm capacity with health checks
132+
133+
<details>
134+
<summary><strong>Architecture patterns</strong></summary>
135+
136+
- Active/Active with Front Door or Traffic Manager across 2+ paired regions
137+
- VMSS Flexible Orchestration with multiple SKUs in priority order
138+
- AKS multiple node pools with alternative VM sizes and zones
139+
- Stateless app tier and stateful data layer with geo-replication (ZRS/GRS, AG listener, Cosmos DB multi-region)
140+
- Deployment rings: canary → regional → multi-region
141+
- Feature flags to toggle region or SKU at runtime
142+
143+
</details>
144+
145+
## Operational playbooks (runbooks)
146+
147+
- Detect: Monitor allocation failures and quota nearing thresholds
148+
- Decide: Auto-select alternative AZ/SKU/region according to policy
149+
- Do: Retry with relaxed constraints; escalate quota requests automatically
150+
- Document: Log incidents, annotate cost/latency impacts, and update the allowlists
151+
152+
<details>
153+
<summary><strong>Runbook examples</strong></summary>
154+
155+
- VMSS scale-out fails with AllocationFailure → retry with next SKU; if repeat, pick next AZ; if repeat, shift to secondary region
156+
- AKS pending pods due to unschedulable nodes → enable/verify Cluster Autoscaler; add alt-size node pool; temporarily taint/cordon and drain
157+
- QuotaExceeded detected in What-If → programmatically raise per-family quota and block merge until approved
158+
- GPU scarcity → shift batch/training to Batch with low-priority or spot VMs in alternate region; queue jobs
159+
160+
</details>
161+
162+
## Automation examples (CLI/PowerShell/Bicep/KQL)
163+
164+
- List available VM sizes/SKUs by region
165+
166+
```powershell
167+
# PowerShell
168+
Get-AzVMSize -Location eastus | Sort-Object Name | Select-Object -First 10
169+
```
170+
171+
```json
172+
// Bicep (snippet) - VMSS Flexible with multiple SKUs
173+
// Note: illustrative snippet; adapt to your module style
174+
```
175+
176+
```bicep
177+
param location string = resourceGroup().location
178+
param skuPrimary string = 'Standard_D4s_v5'
179+
param skuAlt string = 'Standard_D2s_v5'
180+
181+
resource vmss 'Microsoft.Compute/virtualMachineScaleSets@2024-03-01' = {
182+
name: 'web-flex'
183+
location: location
184+
sku: {
185+
name: skuPrimary
186+
capacity: 2
187+
}
188+
properties: {
189+
orchestrationMode: 'Flexible'
190+
upgradePolicy: { mode: 'Rolling' }
191+
virtualMachineProfile: {
192+
priorityMixPolicy: {
193+
baseRegularPriorityCount: 2
194+
}
195+
osProfile: {
196+
computerNamePrefix: 'web'
197+
adminUsername: 'azureuser'
198+
}
199+
storageProfile: {
200+
imageReference: {
201+
publisher: 'Canonical'
202+
offer: '0001-com-ubuntu-server-jammy'
203+
sku: '22_04-lts'
204+
version: 'latest'
205+
}
206+
}
207+
networkProfile: {
208+
networkInterfaceConfigurations: [
209+
{
210+
name: 'nic'
211+
properties: {
212+
primary: true
213+
ipConfigurations: [{ name: 'ipconfig' }]
214+
}
215+
}
216+
]
217+
}
218+
}
219+
}
220+
}
221+
222+
// Alternate SKU VM resource to join VMSS Flex as instance
223+
resource vmAlt 'Microsoft.Compute/virtualMachines@2024-03-01' = {
224+
name: 'web-alt-001'
225+
location: location
226+
properties: {
227+
virtualMachineScaleSet: {
228+
id: vmss.id
229+
}
230+
hardwareProfile: {
231+
vmSize: skuAlt
232+
}
233+
storageProfile: {
234+
imageReference: {
235+
publisher: 'Canonical'
236+
offer: '0001-com-ubuntu-server-jammy'
237+
sku: '22_04-lts'
238+
version: 'latest'
239+
}
240+
}
241+
osProfile: {
242+
computerName: 'web-alt-001'
243+
adminUsername: 'azureuser'
244+
linuxConfiguration: { disablePasswordAuthentication: true }
245+
}
246+
networkProfile: {
247+
networkInterfaces: [
248+
{
249+
id: resourceId('Microsoft.Network/networkInterfaces', 'nic-web-alt-001')
250+
properties: { primary: true }
251+
}
252+
]
253+
}
254+
}
255+
}
256+
```
257+
258+
- Query allocation failures and quotas in Activity Logs and Azure Monitor
259+
260+
```kql
261+
// Activity Logs: VM allocation failures last 24h
262+
AzureActivity
263+
| where TimeGenerated > ago(24h)
264+
| where OperationNameValue has 'write' and ActivityStatusValue == 'Failed'
265+
| where Properties has_any ('AllocationFailure','Overconstrained','SKUNotAvailable','QuotaExceeded')
266+
| project TimeGenerated, ResourceGroup, Resource, OperationNameValue, ActivityStatusValue, Properties
267+
```
268+
269+
- Programmatically request quota increases
270+
271+
```powershell
272+
# Example: Increase vCPU per-VM family quota
273+
# Note: Use Az.Quota cmdlets when available in your environment
274+
# Fallback to Azure Portal or REST API for specific providers if needed
275+
```
276+
277+
- Validate SKU availability via CLI
278+
279+
```powershell
280+
# Azure CLI in PowerShell shell
281+
az vm list-skus --location eastus --output table | Select-String D4s_v5
282+
```
283+
284+
## AKS- and PaaS-specific guidance
285+
286+
- AKS
287+
- Multiple node pools with different VM sizes and zones
288+
- Use Cluster Autoscaler and Pod PriorityClasses for critical workloads
289+
- Consider virtual nodes (ACI) for burst
290+
- Pre-pull container images to reduce cold-start contention
291+
- For GPUs, pre-create tainted GPU pools and schedule with tolerations
292+
293+
- App Service / Functions
294+
- Use multiple worker tiers and regional deployments with Traffic Manager/Front Door
295+
- For Premium plans, pre-warm instances; use scale-out rules with headroom
296+
- Consumption plans: plan for throttling and cold starts; consider Premium for predictability
297+
298+
- Databases
299+
- For SQL MI or Hyperscale, plan capacity with HA/DR replicas in paired regions
300+
- Use ZRS/GRS storage where applicable; monitor IO caps
301+
302+
## Testing, drill, and validation
303+
304+
- Pre-deployment What-If on all IaC changes
305+
- Chaos/scale drills that simulate regional or AZ scarcity
306+
- Blue/green or canary across regions to validate fallback
307+
- Regularly rehearse quota raise workflows and SLAs
308+
- Maintain a sandbox subscription for destructive allocation tests
309+
310+
## Cost, reservations, and risk trade-offs
311+
312+
- Reservations for base capacity (commitment vs. flexibility)
313+
- Savings Plans for broader compute coverage
314+
- Premium SKUs vs. Standard with caching and scale-out
315+
- Cross-region data egress vs. availability objectives
316+
- Spot/low-priority for non-critical batch
317+
318+
## Checklist
319+
320+
- Defined primary and secondary regions with tested failover
321+
- SKU fallback list encoded in IaC
322+
- Quota thresholds monitored and auto-escalated
323+
- What-If and SKU availability checks in CI
324+
- VMSS/AKS configured for flexible placement
325+
- Incident runbooks documented and exercised
326+
327+
<!-- START BADGE -->
328+
<div align="center">
329+
<img src="https://img.shields.io/badge/Total%20views-1-limegreen" alt="Total views">
330+
<p>Refresh Date: 2025-08-20</p>
331+
</div>
332+
<!-- END BADGE -->

0 commit comments

Comments
 (0)