@@ -51,6 +51,13 @@ Last updated: 2025-08-20
5151- [ Proactive planning and design] ( #proactive-planning-and-design )
5252- [ Operational playbooks (runbooks)] ( #operational-playbooks-runbooks )
5353- [ Automation examples (CLI/PowerShell/Bicep/KQL)] ( #automation-examples-clipowershellbicepkql )
54+ - [ Error-to-Action mapping] ( #error-to-action-mapping )
55+ - [ Alerting and auto-remediation] ( #alerting-and-auto-remediation )
56+ - [ CI/CD gates and policy guardrails] ( #cicd-gates-and-policy-guardrails )
57+ - [ Quota-as-Code automation] ( #quota-as-code-automation )
58+ - [ Policy config schema (region/SKU/quotas)] ( #policy-config-schema-regionskuquotas )
59+ - [ IaC quickstart: Action Group + Alerts + Logic App] ( #iac-quickstart-action-group--alerts--logic-app )
60+ - [ SKU/Region fallback playbook] ( #skuregion-fallback-playbook )
5461- [ AKS- and PaaS-specific guidance] ( #aks--and-paas-specific-guidance )
5562- [ Testing, drill, and validation] ( #testing-drill-and-validation )
5663- [ Cost, reservations, and risk trade-offs] ( #cost-reservations-and-risk-trade-offs )
@@ -281,6 +288,296 @@ AzureActivity
281288az vm list-skus --location eastus --output table | Select-String D4s_v5
282289```
283290
291+ ## Error-to-Action mapping
292+
293+ - AllocationFailure
294+ - Immediate: Retry with next allowed SKU (same region, any AZ) via VMSS Flex or parameterized IaC
295+ - Next: Try alternative AZ; if still failing, try paired/secondary region
296+ - Follow-up: Open capacity ticket only if pattern persists across AZs/regions; enrich with Activity Log evidence
297+
298+ - SKUNotAvailable
299+ - Immediate: Switch to nearest-performance SKU or adjacent family (e.g., Dv5 ↔ Ev5) from an approved allowlist
300+ - Next: Check region availability list; move only burst capacity when possible
301+ - Follow-up: Update allowlist; revisit reservations/savings plans to align with observed availability
302+
303+ - QuotaExceeded
304+ - Immediate: Re-balance to other families/regions or temporarily cap scale-out
305+ - Next: Auto-raise per-VM-family vCPU quota with approval workflow; block rollout until raised
306+ - Follow-up: Increase proactive thresholds; embed What-If gates in CI
307+
308+ - OverconstrainedAllocationRequest / ZonalAllocationFailed
309+ - Immediate: Relax constraints (allow any-of zones; remove non-critical PPG)
310+ - Next: Add alternate SKU or region; retry with wider placement
311+ - Follow-up: Document minimal viable constraints in design
312+
313+ ## Alerting and auto-remediation
314+
315+ - Alerts to create
316+ - Activity log alert: AllocationFailure / SKUNotAvailable / QuotaExceeded events
317+ - Metric alerts: VMSS pending instances, AKS Pending pods > N for M minutes
318+ - Service Health: Regional capacity advisories for target regions
319+
320+ - KQL alert (activity failures)
321+ ``` kql
322+ AzureActivity
323+ | where TimeGenerated > ago(15m)
324+ | where ActivityStatusValue == 'Failed'
325+ | where Properties has_any ('AllocationFailure','SKUNotAvailable','QuotaExceeded','Overconstrained')
326+ | summarize failures = count() by bin(TimeGenerated, 5m)
327+ | where failures > 0
328+ ```
329+
330+ - Auto-remediation patterns
331+ - Logic App/Function: on alert, re-deploy with next SKU/AZ, or create quota request; attach incident context
332+ - Pipeline gate: block infra rollout if capacity alerts fired in last 30 minutes
333+ - Ticketing integration: create/route incident with runbook decision tree
334+
335+ ## CI/CD gates and policy guardrails
336+
337+ - Pipeline gates
338+ - Deployment What-If on all infra PRs; fail when QuotaExceeded is predicted
339+ - SKU availability probe per target region before rollout
340+ - Require populated fallback parameters (alt SKUs, secondary region)
341+
342+ - Policy guardrails (examples)
343+ - Deny disallowed SKUs; Audit PPG usage unless tag reason is present
344+ - Require minRegions >= 2 for tier-X services
345+ - Enforce tags: region-priority, sku-allowlist-version
346+
347+ ## Quota-as-Code automation
348+
349+ - Desired state approach
350+ - Track per-VM-family vCPU quotas by region in config (YAML/JSON)
351+ - Pipeline reconciles desired vs actual and raises requests ahead of scale events
352+
353+ - Example outline (PowerShell pseudocode)
354+ ``` powershell
355+ $desired = @(
356+ @{ region='eastus'; family='Dsv5'; vcpus=200 },
357+ @{ region='eastus2'; family='Dsv5'; vcpus=200 }
358+ )
359+ foreach ($q in $desired) {
360+ # Get current quota for $q.family in $q.region
361+ # If current < $q.vcpus → submit quota increase request and notify approvers
362+ }
363+ ```
364+
365+ ## Policy config schema (region/SKU/quotas)
366+
367+ - Minimal, repo-friendly schema to drive fallback, placement, and quota reconciliation.
368+
369+ ``` yaml
370+ # policy.yaml
371+ version : 1
372+ policy :
373+ regionPriority:
374+ - eastus
375+ - eastus2
376+ - centralus
377+ skuAllowlist:
378+ - Standard_D4s_v5
379+ - Standard_D2s_v5
380+ - Standard_E4s_v5
381+ constraints:
382+ zones: any
383+ requireAcceleratedNetworking: true
384+ quotas:
385+ compute:
386+ Dsv5:
387+ eastus: 200
388+ eastus2: 200
389+ publicIps:
390+ eastus: 100
391+ alerts :
392+ allocationFailure:
393+ window: PT15M
394+ threshold: 1
395+ ```
396+
397+ ``` json
398+ {
399+ "version" : 1 ,
400+ "policy" : {
401+ "regionPriority" : [" eastus" , " eastus2" , " centralus" ],
402+ "skuAllowlist" : [" Standard_D4s_v5" , " Standard_D2s_v5" , " Standard_E4s_v5" ],
403+ "constraints" : {
404+ "zones" : " any" ,
405+ "requireAcceleratedNetworking" : true
406+ },
407+ "quotas" : {
408+ "compute" : { "Dsv5" : { "eastus" : 200 , "eastus2" : 200 } },
409+ "publicIps" : { "eastus" : 100 }
410+ }
411+ },
412+ "alerts" : { "allocationFailure" : { "window" : " PT15M" , "threshold" : 1 } }
413+ }
414+ ```
415+
416+ Guidance
417+ - Source-control the schema; bump version when policy changes. Validate in CI before deploys.
418+ - Feed this into your quota reconciler and your fallback selector to keep behaviors consistent.
419+
420+ ## IaC quickstart: Action Group + Alerts + Logic App
421+
422+ - Deploys: Action Group (common schema enabled), Activity Log Alert for capacity errors, KQL Log Alert, and a Logic App (Consumption) that receives the alert via webhook to trigger an automated fallback/runbook.
423+ - API versions aligned with current schemas: actionGroups@2023-01-01, activityLogAlerts@2020-10-01, scheduledQueryRules@2023-12-01, logic/workflows@2019-05-01.
424+
425+ ``` bicep
426+ // params
427+ param location string = resourceGroup().location
428+ param actionGroupName string = 'cap-alerts-ag'
429+ param actionGroupShort string = 'capag'
430+ param lawResourceId string // Log Analytics workspace resourceId for KQL alert scopes
431+
432+ // Logic App (Consumption) with an HTTP trigger named 'manual'
433+ resource wf 'Microsoft.Logic/workflows@2019-05-01' = {
434+ name: 'cap-fallback-la'
435+ location: location
436+ properties: {
437+ state: 'Enabled'
438+ definition: {
439+ '$schema': 'https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#'
440+ 'contentVersion': '1.0.0.0'
441+ 'parameters': {}
442+ 'triggers': {
443+ 'manual': {
444+ 'type': 'Request',
445+ 'kind': 'Http',
446+ 'inputs': {
447+ 'schema': {}
448+ }
449+ }
450+ }
451+ 'actions': {
452+ 'DecideAndInvoke': {
453+ 'type': 'Http',
454+ 'inputs': {
455+ 'method': 'POST',
456+ // TODO: replace with your pipeline/runbook endpoint
457+ 'uri': 'https://example.com/fallback-run',
458+ 'headers': {
459+ 'Content-Type': 'application/json'
460+ },
461+ 'body': {
462+ 'alert': "@{triggerBody()}"
463+ }
464+ }
465+ }
466+ },
467+ 'outputs': {}
468+ }
469+ }
470+ }
471+
472+ // Build Logic App trigger callback URL for Action Group receiver
473+ var wfTriggerCallback = listCallbackUrl(resourceId('Microsoft.Logic/workflows/triggers', wf.name, 'manual'), '2019-05-01').value
474+
475+ // Action Group with Logic App receiver (Common Alert Schema recommended)
476+ resource ag 'Microsoft.Insights/actionGroups@2023-01-01' = {
477+ name: actionGroupName
478+ location: 'global'
479+ properties: {
480+ enabled: true
481+ groupShortName: actionGroupShort
482+ logicAppReceivers: [
483+ {
484+ name: 'cap-fallback-la'
485+ resourceId: wf.id
486+ callbackUrl: wfTriggerCallback
487+ useCommonAlertSchema: true
488+ }
489+ ]
490+ }
491+ }
492+
493+ // Activity Log Alert for capacity-related failures
494+ resource ala 'Microsoft.Insights/activityLogAlerts@2020-10-01' = {
495+ name: 'cap-activity-alert'
496+ location: 'global'
497+ properties: {
498+ enabled: true
499+ scopes: [ subscription().id ]
500+ condition: {
501+ allOf: [
502+ { field: 'status', equals: 'Failed' }
503+ { field: 'category', equals: 'Administrative' }
504+ // Match common capacity errors embedded in properties
505+ { field: 'properties', containsAny: [ 'AllocationFailure', 'SKUNotAvailable', 'QuotaExceeded', 'Overconstrained' ] }
506+ ]
507+ }
508+ actions: {
509+ actionGroups: [ { actionGroupId: ag.id } ]
510+ }
511+ description: 'Route capacity allocation/quota failures to Logic App for auto-remediation.'
512+ }
513+ }
514+
515+ // Scheduled Query (KQL) Alert over Activity Logs (or over AzureActivity in LAW)
516+ resource kql 'Microsoft.Insights/scheduledQueryRules@2023-12-01' = {
517+ name: 'cap-kql-alert'
518+ location: location
519+ properties: {
520+ enabled: true
521+ displayName: 'Capacity allocation failures (KQL)'
522+ description: 'Detect allocation/quota failures via KQL and invoke action group.'
523+ severity: 2
524+ evaluationFrequency: 'PT5M'
525+ windowSize: 'PT15M'
526+ criteria: {
527+ allOf: [
528+ {
529+ query: '''
530+ AzureActivity
531+ | where TimeGenerated > ago(15m)
532+ | where ActivityStatusValue == "Failed"
533+ | where Properties has_any ("AllocationFailure","SKUNotAvailable","QuotaExceeded","Overconstrained")
534+ | summarize failures = count()
535+ '''
536+ timeAggregation: 'Count'
537+ operator: 'GreaterThan'
538+ threshold: 0
539+ }
540+ ]
541+ }
542+ scopes: [ lawResourceId ]
543+ actions: {
544+ actionGroups: [ ag.id ]
545+ customProperties: {
546+ scenario: 'capacity-fallback'
547+ }
548+ }
549+ autoMitigate: true
550+ }
551+ }
552+ ```
553+
554+ Notes
555+ - If you prefer, use Azure Verified Modules instead of raw resources: action group (avm/res/insights/action-group), activity log alert (avm/res/insights/activity-log-alert), scheduled query rule (avm/res/insights/scheduled-query-rule), logic app (avm/res/logic/workflow).
556+ - For private Logic App ingress, swap to a function receiver in the Action Group and authorize with an AAD app or MSI.
557+
558+ ## SKU/Region fallback playbook
559+
560+ - Decision tree
561+ ```
562+ Start → Try Primary SKU in Primary Region (any AZ)
563+ ├─ Success → Done
564+ └─ Fail (AllocationFailure/SKUNotAvailable)
565+ → Try Alt SKU in Primary Region (any AZ)
566+ ├─ Success → Done
567+ └─ Fail → Try Primary SKU in Secondary Region (any AZ)
568+ ├─ Success → Done
569+ └─ Fail → Queue/Defer, or escalate (quota/capacity ticket)
570+ ```
571+
572+ - Inputs
573+ - sku_allowlist: [D4s_v5, D2s_v5, E4s_v5]
574+ - region_priority: [eastus, eastus2, centralus]
575+ - constraints: requireAcceleratedNetworking=true, zones=any
576+
577+ - Outputs
578+ - Selected deployment tuple: (region, zone, sku)
579+ - Incident created if no viable path found
580+
284581## AKS- and PaaS-specific guidance
285582
286583- AKS
@@ -326,7 +623,7 @@ az vm list-skus --location eastus --output table | Select-String D4s_v5
326623
327624<!-- START BADGE -->
328625<div align =" center " >
329- <img src =" https://img.shields.io/badge/Total%20views-1332 -limegreen " alt =" Total views " >
330- <p >Refresh Date: 2025-08-20</p >
626+ <img src="https://img.shields.io/badge/Total%20views-1 -limegreen" alt="Total views">
627+ <p>Refresh Date: 2025-08-20</p>
331628</div >
332629<!-- END BADGE -->
0 commit comments