Skip to content

Commit 180a0ae

Browse files
authored
refs
1 parent 828786d commit 180a0ae

File tree

1 file changed

+299
-2
lines changed

1 file changed

+299
-2
lines changed

3-Azure_Capacity_Challenges.md

Lines changed: 299 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -51,6 +51,13 @@ Last updated: 2025-08-20
5151
- [Proactive planning and design](#proactive-planning-and-design)
5252
- [Operational playbooks (runbooks)](#operational-playbooks-runbooks)
5353
- [Automation examples (CLI/PowerShell/Bicep/KQL)](#automation-examples-clipowershellbicepkql)
54+
- [Error-to-Action mapping](#error-to-action-mapping)
55+
- [Alerting and auto-remediation](#alerting-and-auto-remediation)
56+
- [CI/CD gates and policy guardrails](#cicd-gates-and-policy-guardrails)
57+
- [Quota-as-Code automation](#quota-as-code-automation)
58+
- [Policy config schema (region/SKU/quotas)](#policy-config-schema-regionskuquotas)
59+
- [IaC quickstart: Action Group + Alerts + Logic App](#iac-quickstart-action-group--alerts--logic-app)
60+
- [SKU/Region fallback playbook](#skuregion-fallback-playbook)
5461
- [AKS- and PaaS-specific guidance](#aks--and-paas-specific-guidance)
5562
- [Testing, drill, and validation](#testing-drill-and-validation)
5663
- [Cost, reservations, and risk trade-offs](#cost-reservations-and-risk-trade-offs)
@@ -281,6 +288,296 @@ AzureActivity
281288
az vm list-skus --location eastus --output table | Select-String D4s_v5
282289
```
283290

291+
## Error-to-Action mapping
292+
293+
- AllocationFailure
294+
- Immediate: Retry with next allowed SKU (same region, any AZ) via VMSS Flex or parameterized IaC
295+
- Next: Try alternative AZ; if still failing, try paired/secondary region
296+
- Follow-up: Open capacity ticket only if pattern persists across AZs/regions; enrich with Activity Log evidence
297+
298+
- SKUNotAvailable
299+
- Immediate: Switch to nearest-performance SKU or adjacent family (e.g., Dv5 ↔ Ev5) from an approved allowlist
300+
- Next: Check region availability list; move only burst capacity when possible
301+
- Follow-up: Update allowlist; revisit reservations/savings plans to align with observed availability
302+
303+
- QuotaExceeded
304+
- Immediate: Re-balance to other families/regions or temporarily cap scale-out
305+
- Next: Auto-raise per-VM-family vCPU quota with approval workflow; block rollout until raised
306+
- Follow-up: Increase proactive thresholds; embed What-If gates in CI
307+
308+
- OverconstrainedAllocationRequest / ZonalAllocationFailed
309+
- Immediate: Relax constraints (allow any-of zones; remove non-critical PPG)
310+
- Next: Add alternate SKU or region; retry with wider placement
311+
- Follow-up: Document minimal viable constraints in design
312+
313+
## Alerting and auto-remediation
314+
315+
- Alerts to create
316+
- Activity log alert: AllocationFailure / SKUNotAvailable / QuotaExceeded events
317+
- Metric alerts: VMSS pending instances, AKS Pending pods > N for M minutes
318+
- Service Health: Regional capacity advisories for target regions
319+
320+
- KQL alert (activity failures)
321+
```kql
322+
AzureActivity
323+
| where TimeGenerated > ago(15m)
324+
| where ActivityStatusValue == 'Failed'
325+
| where Properties has_any ('AllocationFailure','SKUNotAvailable','QuotaExceeded','Overconstrained')
326+
| summarize failures = count() by bin(TimeGenerated, 5m)
327+
| where failures > 0
328+
```
329+
330+
- Auto-remediation patterns
331+
- Logic App/Function: on alert, re-deploy with next SKU/AZ, or create quota request; attach incident context
332+
- Pipeline gate: block infra rollout if capacity alerts fired in last 30 minutes
333+
- Ticketing integration: create/route incident with runbook decision tree
334+
335+
## CI/CD gates and policy guardrails
336+
337+
- Pipeline gates
338+
- Deployment What-If on all infra PRs; fail when QuotaExceeded is predicted
339+
- SKU availability probe per target region before rollout
340+
- Require populated fallback parameters (alt SKUs, secondary region)
341+
342+
- Policy guardrails (examples)
343+
- Deny disallowed SKUs; Audit PPG usage unless tag reason is present
344+
- Require minRegions >= 2 for tier-X services
345+
- Enforce tags: region-priority, sku-allowlist-version
346+
347+
## Quota-as-Code automation
348+
349+
- Desired state approach
350+
- Track per-VM-family vCPU quotas by region in config (YAML/JSON)
351+
- Pipeline reconciles desired vs actual and raises requests ahead of scale events
352+
353+
- Example outline (PowerShell pseudocode)
354+
```powershell
355+
$desired = @(
356+
@{ region='eastus'; family='Dsv5'; vcpus=200 },
357+
@{ region='eastus2'; family='Dsv5'; vcpus=200 }
358+
)
359+
foreach ($q in $desired) {
360+
# Get current quota for $q.family in $q.region
361+
# If current < $q.vcpus → submit quota increase request and notify approvers
362+
}
363+
```
364+
365+
## Policy config schema (region/SKU/quotas)
366+
367+
- Minimal, repo-friendly schema to drive fallback, placement, and quota reconciliation.
368+
369+
```yaml
370+
# policy.yaml
371+
version: 1
372+
policy:
373+
regionPriority:
374+
- eastus
375+
- eastus2
376+
- centralus
377+
skuAllowlist:
378+
- Standard_D4s_v5
379+
- Standard_D2s_v5
380+
- Standard_E4s_v5
381+
constraints:
382+
zones: any
383+
requireAcceleratedNetworking: true
384+
quotas:
385+
compute:
386+
Dsv5:
387+
eastus: 200
388+
eastus2: 200
389+
publicIps:
390+
eastus: 100
391+
alerts:
392+
allocationFailure:
393+
window: PT15M
394+
threshold: 1
395+
```
396+
397+
```json
398+
{
399+
"version": 1,
400+
"policy": {
401+
"regionPriority": ["eastus", "eastus2", "centralus"],
402+
"skuAllowlist": ["Standard_D4s_v5", "Standard_D2s_v5", "Standard_E4s_v5"],
403+
"constraints": {
404+
"zones": "any",
405+
"requireAcceleratedNetworking": true
406+
},
407+
"quotas": {
408+
"compute": { "Dsv5": { "eastus": 200, "eastus2": 200 } },
409+
"publicIps": { "eastus": 100 }
410+
}
411+
},
412+
"alerts": { "allocationFailure": { "window": "PT15M", "threshold": 1 } }
413+
}
414+
```
415+
416+
Guidance
417+
- Source-control the schema; bump version when policy changes. Validate in CI before deploys.
418+
- Feed this into your quota reconciler and your fallback selector to keep behaviors consistent.
419+
420+
## IaC quickstart: Action Group + Alerts + Logic App
421+
422+
- Deploys: Action Group (common schema enabled), Activity Log Alert for capacity errors, KQL Log Alert, and a Logic App (Consumption) that receives the alert via webhook to trigger an automated fallback/runbook.
423+
- API versions aligned with current schemas: actionGroups@2023-01-01, activityLogAlerts@2020-10-01, scheduledQueryRules@2023-12-01, logic/workflows@2019-05-01.
424+
425+
```bicep
426+
// params
427+
param location string = resourceGroup().location
428+
param actionGroupName string = 'cap-alerts-ag'
429+
param actionGroupShort string = 'capag'
430+
param lawResourceId string // Log Analytics workspace resourceId for KQL alert scopes
431+
432+
// Logic App (Consumption) with an HTTP trigger named 'manual'
433+
resource wf 'Microsoft.Logic/workflows@2019-05-01' = {
434+
name: 'cap-fallback-la'
435+
location: location
436+
properties: {
437+
state: 'Enabled'
438+
definition: {
439+
'$schema': 'https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#'
440+
'contentVersion': '1.0.0.0'
441+
'parameters': {}
442+
'triggers': {
443+
'manual': {
444+
'type': 'Request',
445+
'kind': 'Http',
446+
'inputs': {
447+
'schema': {}
448+
}
449+
}
450+
}
451+
'actions': {
452+
'DecideAndInvoke': {
453+
'type': 'Http',
454+
'inputs': {
455+
'method': 'POST',
456+
// TODO: replace with your pipeline/runbook endpoint
457+
'uri': 'https://example.com/fallback-run',
458+
'headers': {
459+
'Content-Type': 'application/json'
460+
},
461+
'body': {
462+
'alert': "@{triggerBody()}"
463+
}
464+
}
465+
}
466+
},
467+
'outputs': {}
468+
}
469+
}
470+
}
471+
472+
// Build Logic App trigger callback URL for Action Group receiver
473+
var wfTriggerCallback = listCallbackUrl(resourceId('Microsoft.Logic/workflows/triggers', wf.name, 'manual'), '2019-05-01').value
474+
475+
// Action Group with Logic App receiver (Common Alert Schema recommended)
476+
resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = {
477+
name: actionGroupName
478+
location: 'global'
479+
properties: {
480+
enabled: true
481+
groupShortName: actionGroupShort
482+
logicAppReceivers: [
483+
{
484+
name: 'cap-fallback-la'
485+
resourceId: wf.id
486+
callbackUrl: wfTriggerCallback
487+
useCommonAlertSchema: true
488+
}
489+
]
490+
}
491+
}
492+
493+
// Activity Log Alert for capacity-related failures
494+
resource ala 'Microsoft.Insights/activityLogAlerts@2020-10-01' = {
495+
name: 'cap-activity-alert'
496+
location: 'global'
497+
properties: {
498+
enabled: true
499+
scopes: [ subscription().id ]
500+
condition: {
501+
allOf: [
502+
{ field: 'status', equals: 'Failed' }
503+
{ field: 'category', equals: 'Administrative' }
504+
// Match common capacity errors embedded in properties
505+
{ field: 'properties', containsAny: [ 'AllocationFailure', 'SKUNotAvailable', 'QuotaExceeded', 'Overconstrained' ] }
506+
]
507+
}
508+
actions: {
509+
actionGroups: [ { actionGroupId: ag.id } ]
510+
}
511+
description: 'Route capacity allocation/quota failures to Logic App for auto-remediation.'
512+
}
513+
}
514+
515+
// Scheduled Query (KQL) Alert over Activity Logs (or over AzureActivity in LAW)
516+
resource kql 'Microsoft.Insights/scheduledQueryRules@2023-12-01' = {
517+
name: 'cap-kql-alert'
518+
location: location
519+
properties: {
520+
enabled: true
521+
displayName: 'Capacity allocation failures (KQL)'
522+
description: 'Detect allocation/quota failures via KQL and invoke action group.'
523+
severity: 2
524+
evaluationFrequency: 'PT5M'
525+
windowSize: 'PT15M'
526+
criteria: {
527+
allOf: [
528+
{
529+
query: '''
530+
AzureActivity
531+
| where TimeGenerated > ago(15m)
532+
| where ActivityStatusValue == "Failed"
533+
| where Properties has_any ("AllocationFailure","SKUNotAvailable","QuotaExceeded","Overconstrained")
534+
| summarize failures = count()
535+
'''
536+
timeAggregation: 'Count'
537+
operator: 'GreaterThan'
538+
threshold: 0
539+
}
540+
]
541+
}
542+
scopes: [ lawResourceId ]
543+
actions: {
544+
actionGroups: [ ag.id ]
545+
customProperties: {
546+
scenario: 'capacity-fallback'
547+
}
548+
}
549+
autoMitigate: true
550+
}
551+
}
552+
```
553+
554+
Notes
555+
- If you prefer, use Azure Verified Modules instead of raw resources: action group (avm/res/insights/action-group), activity log alert (avm/res/insights/activity-log-alert), scheduled query rule (avm/res/insights/scheduled-query-rule), logic app (avm/res/logic/workflow).
556+
- For private Logic App ingress, swap to a function receiver in the Action Group and authorize with an AAD app or MSI.
557+
558+
## SKU/Region fallback playbook
559+
560+
- Decision tree
561+
```
562+
Start → Try Primary SKU in Primary Region (any AZ)
563+
├─ Success → Done
564+
└─ Fail (AllocationFailure/SKUNotAvailable)
565+
→ Try Alt SKU in Primary Region (any AZ)
566+
├─ Success → Done
567+
└─ Fail → Try Primary SKU in Secondary Region (any AZ)
568+
├─ Success → Done
569+
└─ Fail → Queue/Defer, or escalate (quota/capacity ticket)
570+
```
571+
572+
- Inputs
573+
- sku_allowlist: [D4s_v5, D2s_v5, E4s_v5]
574+
- region_priority: [eastus, eastus2, centralus]
575+
- constraints: requireAcceleratedNetworking=true, zones=any
576+
577+
- Outputs
578+
- Selected deployment tuple: (region, zone, sku)
579+
- Incident created if no viable path found
580+
284581
## AKS- and PaaS-specific guidance
285582

286583
- AKS
@@ -326,7 +623,7 @@ az vm list-skus --location eastus --output table | Select-String D4s_v5
326623

327624
<!-- START BADGE -->
328625
<div align="center">
329-
<img src="https://img.shields.io/badge/Total%20views-1332-limegreen" alt="Total views">
330-
<p>Refresh Date: 2025-08-20</p>
626+
<img src="https://img.shields.io/badge/Total%20views-1-limegreen" alt="Total views">
627+
<p>Refresh Date: 2025-08-20</p>
331628
</div>
332629
<!-- END BADGE -->

0 commit comments

Comments
 (0)