This implementation provides balanced multi-zone pod distribution for StatefulSets in AKS, EKS, and GKE clusters to prevent quorum loss during zone failures while ensuring reliable scheduling.
V4_CFG_MULTI_ZONE_ENABLED: Master switch for multi-zone distribution (default: true)V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: RabbitMQ distribution control (default: true)V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: PostgreSQL distribution control (default: true)V4_CFG_MULTI_ZONE_CONSUL_ENABLED: Consul distribution control (default: true)V4_CFG_MULTI_ZONE_REDIS_ENABLED: Redis distribution control (default: true)V4_CFG_MULTI_ZONE_OPENDISTRO_ENABLED: OpenDistro/OpenSearch distribution control (default: true)V4_CFG_MULTI_ZONE_WORKLOAD_ORCHESTRATOR_ENABLED: Workload Orchestrator distribution control (default: true)V4_CFG_MULTI_ZONE_DATA_AGENT_ENABLED: Data Agent Server distribution control (default: true)V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: Restrict to stateful nodepools (default: true)V4_CFG_STATEFUL_NODEPOOL_LABEL: Label for stateful nodepool identification (default: "workload.sas.com/class")V4_CFG_MULTI_ZONE_AUTO_DETECT: Automatically detect multi-zone clusters (default: true)V4_CFG_SINGLE_ZONE_FALLBACK: Apply relaxed constraints for single-zone clusters (default: true)
V4_CFG_MULTI_ZONE_ENABLED: true
V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: true
V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: true
V4_CFG_MULTI_ZONE_CONSUL_ENABLED: true
V4_CFG_MULTI_ZONE_REDIS_ENABLED: true
V4_CFG_MULTI_ZONE_OPENDISTRO_ENABLED: true
V4_CFG_MULTI_ZONE_WORKLOAD_ORCHESTRATOR_ENABLED: true
V4_CFG_MULTI_ZONE_DATA_AGENT_ENABLED: true
V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: true
V4_CFG_STATEFUL_NODEPOOL_LABEL: "workload.sas.com/class"
V4_CFG_MULTI_ZONE_AUTO_DETECT: true
V4_CFG_SINGLE_ZONE_FALLBACK: true-
Zone Distribution:
maxSkew: 1ontopology.kubernetes.io/zonewithDoNotSchedule- Strict enforcement at zone level to prevent concentration
- Ensures StatefulSet replicas are distributed across availability zones
- Primary protection against zone failures (PSCLOUD-64 resolution)
-
Node Distribution:
maxSkew: 1onkubernetes.io/hostnamewithScheduleAnyway- Best-effort spreading at node level without blocking scheduling
- Kubernetes attempts to spread pods across different nodes when possible
- Will not prevent pod scheduling if perfect node balance cannot be achieved
- Prevents scheduling deadlock when combined with zone-level constraints
- Required Node Affinity: Configurable nodepool label restriction (default:
workload.sas.com/class=stateful)- Ensures StatefulSets only schedule on nodes with the specified stateful nodepool label
- Prevents cross-nodepool scheduling that could compromise zone isolation
- Supports both modern (
workload.sas.com/class) and legacy (agentpool) label formats
- Host Distribution: Preferred anti-affinity for
kubernetes.io/hostname- Attempts to spread pods across different nodes when possible
- Uses weight: 100 preference (not required)
- Zone Failure Protection: Distributes StatefulSet replicas across availability zones
- Nodepool Isolation: Prevents StatefulSets from mixing with stateless workloads
- Quorum Safety: Single zone failure won't compromise StatefulSet availability
- Reliable Scheduling: Balanced constraints allow successful deployment
- Multi-Cloud Support: Works with AKS, EKS, and GKE
- Comprehensive Coverage: Supports 7 critical StatefulSet workloads
- Automatic Detection: Auto-detects multi-zone clusters and applies appropriate constraints
- Single-Zone Fallback: Gracefully handles single-zone deployments with relaxed constraints
This implementation provides multi-zone distribution for the following StatefulSet workloads:
- sas-rabbitmq-server - Message queue service
- sas-crunchy-platform-postgres - PostgreSQL database (Crunchy operator)
- sas-consul-server - Service discovery and configuration
- sas-redis-server - Caching and session store
- sas-opendistro - Search and logging infrastructure (OpenSearch/OpenDistro)
- sas-workload-orchestrator - Job scheduling and orchestration
- sas-data-agent-server-colocated - Data agent services
Enable in your ansible-vars.yaml:
V4_CFG_MULTI_ZONE_ENABLED: true
V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: true
V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: true
V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: trueEnsure your stateful nodepool is labeled correctly. The default label is:
kubectl label nodes <stateful-node> workload.sas.com/class=statefulYou can customize the nodepool label using:
V4_CFG_STATEFUL_NODEPOOL_LABEL: "workload.sas.com/class"For legacy deployments using agentpool label:
kubectl label nodes <stateful-node> agentpool=statefulChaos testing was performed to validate multi-zone resilience by cordoning all nodes in a zone and deleting StatefulSet pods to simulate complete zone failure.
Test Scenario:
- Cordoned all stateful nodes in single zone
- Deleted pods (RabbitMQ, Consul, Redis) that were running on the cordoned zone
- Monitored rescheduling behavior and constraint enforcement
Observed Behavior:
- Deleted pods entered
Pendingstate and could not reschedule to remaining zones - Topology constraints prevented scheduling that would violate
maxSkew: 1 - With current distribution 0-1-1 (after zone-1 failure), scheduling to either remaining zone would create 0-2-1 or 0-1-2 distribution (skew = 2), which violates the constraint
- Pods remained
Pendinguntil the failed zone was recovered (node uncordoned) - Once zone became available, pods automatically rescheduled and restored balanced distribution
Validation Result: Topology constraints working as designed
Production Deployment Note:
The hostname-level constraint uses ScheduleAnyway (best-effort) to ensure StatefulSets
can schedule successfully even when perfect node-level balance is not achievable. This
prevents scheduling deadlock while maintaining strict zone-level protection. Zone-level
distribution remains strictly enforced with DoNotSchedule to prevent concentration.
Complete Zone Failure Behavior:
- When an entire availability zone becomes unavailable (all nodes cordoned/failed), affected StatefulSet pods cannot reschedule to remaining zones
- Pods remain in
Pendingstate until the failed zone recovers - This is the intended behavior with strict zone-level constraint:
maxSkew: 1+whenUnsatisfiable: DoNotSchedule
Why This is Acceptable:
- Primary Goal Achieved: Prevents cross-nodepool pods from concentrating in a single zone during normal operations
- Rare Scenario: Complete zone failures are uncommon (Azure/AWS/GCP multi-zone SLA > 99.99%)
- Planned Maintenance: Production zone maintenance is typically planned, allowing for graceful pod draining
- Trade-off Decision: Temporary unavailability during zone outage vs. chronic concentration risk in normal operations
- Production Safety: Hostname-level constraint uses
ScheduleAnywayto prevent scheduling issues during normal operations while zone-level remains strict
Recovery: Once the zone becomes available again, pods automatically reschedule and rebalance:
kubectl uncordon <zone-nodes>
# Pods reschedule automatically to restore balanced distributionIf different scheduling behavior is required, consider:
Option A: Strict Hostname Enforcement
whenUnsatisfiable: DoNotSchedule # For both zone AND hostname- Warning: May cause scheduling deadlock in constrained environments
- Only recommended for clusters with abundant stateful node capacity
Option B: Relax Zone Constraint
# Zone-level
whenUnsatisfiable: ScheduleAnyway # Allows zone concentration
# Hostname-level
whenUnsatisfiable: ScheduleAnyway # Current: best-effort spreading- Warning: Weakens primary PSCLOUD-64 protection
- Not recommended for production multi-zone deployments
Option C: Increase Zone maxSkew
maxSkew: 2 # Allows more imbalanced zone distribution- Warning: Permits concentration (e.g., 0-2-1 or 1-3-2 distribution)
- Reduces protection against zone failures
Current Implementation (Recommended): Uses strict zone enforcement (DoNotSchedule, maxSkew: 1) with best-effort hostname spreading (ScheduleAnyway, maxSkew: 1) to balance zone protection with reliable scheduling.