Skip to content

Latest commit

 

History

History
189 lines (151 loc) · 8.88 KB

File metadata and controls

189 lines (151 loc) · 8.88 KB

Multi-Zone StatefulSet Distribution - Implementation Guide

Overview

This implementation provides balanced multi-zone pod distribution for StatefulSets in AKS, EKS, and GKE clusters to prevent quorum loss during zone failures while ensuring reliable scheduling.

Configuration Variables

Core Settings (roles/vdm/defaults/main.yaml)

  • V4_CFG_MULTI_ZONE_ENABLED: Master switch for multi-zone distribution (default: true)
  • V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: RabbitMQ distribution control (default: true)
  • V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: PostgreSQL distribution control (default: true)
  • V4_CFG_MULTI_ZONE_CONSUL_ENABLED: Consul distribution control (default: true)
  • V4_CFG_MULTI_ZONE_REDIS_ENABLED: Redis distribution control (default: true)
  • V4_CFG_MULTI_ZONE_OPENDISTRO_ENABLED: OpenDistro/OpenSearch distribution control (default: true)
  • V4_CFG_MULTI_ZONE_WORKLOAD_ORCHESTRATOR_ENABLED: Workload Orchestrator distribution control (default: true)
  • V4_CFG_MULTI_ZONE_DATA_AGENT_ENABLED: Data Agent Server distribution control (default: true)
  • V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: Restrict to stateful nodepools (default: true)
  • V4_CFG_STATEFUL_NODEPOOL_LABEL: Label for stateful nodepool identification (default: "workload.sas.com/class")
  • V4_CFG_MULTI_ZONE_AUTO_DETECT: Automatically detect multi-zone clusters (default: true)
  • V4_CFG_SINGLE_ZONE_FALLBACK: Apply relaxed constraints for single-zone clusters (default: true)

Usage in ansible-vars.yaml

V4_CFG_MULTI_ZONE_ENABLED: true
V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: true
V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: true
V4_CFG_MULTI_ZONE_CONSUL_ENABLED: true
V4_CFG_MULTI_ZONE_REDIS_ENABLED: true
V4_CFG_MULTI_ZONE_OPENDISTRO_ENABLED: true
V4_CFG_MULTI_ZONE_WORKLOAD_ORCHESTRATOR_ENABLED: true
V4_CFG_MULTI_ZONE_DATA_AGENT_ENABLED: true
V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: true
V4_CFG_STATEFUL_NODEPOOL_LABEL: "workload.sas.com/class"
V4_CFG_MULTI_ZONE_AUTO_DETECT: true
V4_CFG_SINGLE_ZONE_FALLBACK: true

Implementation Details

Topology Spread Constraints (Balanced Approach)

  • Zone Distribution: maxSkew: 1 on topology.kubernetes.io/zone with DoNotSchedule

    • Strict enforcement at zone level to prevent concentration
    • Ensures StatefulSet replicas are distributed across availability zones
    • Primary protection against zone failures (PSCLOUD-64 resolution)
  • Node Distribution: maxSkew: 1 on kubernetes.io/hostname with ScheduleAnyway

    • Best-effort spreading at node level without blocking scheduling
    • Kubernetes attempts to spread pods across different nodes when possible
    • Will not prevent pod scheduling if perfect node balance cannot be achieved
    • Prevents scheduling deadlock when combined with zone-level constraints

Node Affinity (Nodepool Restriction)

  • Required Node Affinity: Configurable nodepool label restriction (default: workload.sas.com/class=stateful)
    • Ensures StatefulSets only schedule on nodes with the specified stateful nodepool label
    • Prevents cross-nodepool scheduling that could compromise zone isolation
    • Supports both modern (workload.sas.com/class) and legacy (agentpool) label formats

Preferred Pod Anti-Affinity

  • Host Distribution: Preferred anti-affinity for kubernetes.io/hostname
    • Attempts to spread pods across different nodes when possible
    • Uses weight: 100 preference (not required)

Key Benefits

  • Zone Failure Protection: Distributes StatefulSet replicas across availability zones
  • Nodepool Isolation: Prevents StatefulSets from mixing with stateless workloads
  • Quorum Safety: Single zone failure won't compromise StatefulSet availability
  • Reliable Scheduling: Balanced constraints allow successful deployment
  • Multi-Cloud Support: Works with AKS, EKS, and GKE
  • Comprehensive Coverage: Supports 7 critical StatefulSet workloads
  • Automatic Detection: Auto-detects multi-zone clusters and applies appropriate constraints
  • Single-Zone Fallback: Gracefully handles single-zone deployments with relaxed constraints

Supported StatefulSets

This implementation provides multi-zone distribution for the following StatefulSet workloads:

  1. sas-rabbitmq-server - Message queue service
  2. sas-crunchy-platform-postgres - PostgreSQL database (Crunchy operator)
  3. sas-consul-server - Service discovery and configuration
  4. sas-redis-server - Caching and session store
  5. sas-opendistro - Search and logging infrastructure (OpenSearch/OpenDistro)
  6. sas-workload-orchestrator - Job scheduling and orchestration
  7. sas-data-agent-server-colocated - Data agent services

Usage

Enable in your ansible-vars.yaml:

V4_CFG_MULTI_ZONE_ENABLED: true
V4_CFG_MULTI_ZONE_RABBITMQ_ENABLED: true
V4_CFG_MULTI_ZONE_POSTGRES_ENABLED: true
V4_CFG_STATEFUL_NODEPOOL_RESTRICTION: true

Nodepool Requirements

Ensure your stateful nodepool is labeled correctly. The default label is:

kubectl label nodes <stateful-node> workload.sas.com/class=stateful

You can customize the nodepool label using:

V4_CFG_STATEFUL_NODEPOOL_LABEL: "workload.sas.com/class"

For legacy deployments using agentpool label:

kubectl label nodes <stateful-node> agentpool=stateful

Chaos Testing & Validation

Zone Failure Simulation Results

Chaos testing was performed to validate multi-zone resilience by cordoning all nodes in a zone and deleting StatefulSet pods to simulate complete zone failure.

Test Scenario:

  • Cordoned all stateful nodes in single zone
  • Deleted pods (RabbitMQ, Consul, Redis) that were running on the cordoned zone
  • Monitored rescheduling behavior and constraint enforcement

Observed Behavior:

  • Deleted pods entered Pending state and could not reschedule to remaining zones
  • Topology constraints prevented scheduling that would violate maxSkew: 1
  • With current distribution 0-1-1 (after zone-1 failure), scheduling to either remaining zone would create 0-2-1 or 0-1-2 distribution (skew = 2), which violates the constraint
  • Pods remained Pending until the failed zone was recovered (node uncordoned)
  • Once zone became available, pods automatically rescheduled and restored balanced distribution

Validation Result: Topology constraints working as designed

Production Deployment Note: The hostname-level constraint uses ScheduleAnyway (best-effort) to ensure StatefulSets can schedule successfully even when perfect node-level balance is not achievable. This prevents scheduling deadlock while maintaining strict zone-level protection. Zone-level distribution remains strictly enforced with DoNotSchedule to prevent concentration.

Known Limitation (By Design)

Complete Zone Failure Behavior:

  • When an entire availability zone becomes unavailable (all nodes cordoned/failed), affected StatefulSet pods cannot reschedule to remaining zones
  • Pods remain in Pending state until the failed zone recovers
  • This is the intended behavior with strict zone-level constraint: maxSkew: 1 + whenUnsatisfiable: DoNotSchedule

Why This is Acceptable:

  1. Primary Goal Achieved: Prevents cross-nodepool pods from concentrating in a single zone during normal operations
  2. Rare Scenario: Complete zone failures are uncommon (Azure/AWS/GCP multi-zone SLA > 99.99%)
  3. Planned Maintenance: Production zone maintenance is typically planned, allowing for graceful pod draining
  4. Trade-off Decision: Temporary unavailability during zone outage vs. chronic concentration risk in normal operations
  5. Production Safety: Hostname-level constraint uses ScheduleAnyway to prevent scheduling issues during normal operations while zone-level remains strict

Recovery: Once the zone becomes available again, pods automatically reschedule and rebalance:

kubectl uncordon <zone-nodes>
# Pods reschedule automatically to restore balanced distribution

Alternative Constraint Options

If different scheduling behavior is required, consider:

Option A: Strict Hostname Enforcement

whenUnsatisfiable: DoNotSchedule  # For both zone AND hostname
  • Warning: May cause scheduling deadlock in constrained environments
  • Only recommended for clusters with abundant stateful node capacity

Option B: Relax Zone Constraint

# Zone-level
whenUnsatisfiable: ScheduleAnyway  # Allows zone concentration

# Hostname-level  
whenUnsatisfiable: ScheduleAnyway  # Current: best-effort spreading
  • Warning: Weakens primary PSCLOUD-64 protection
  • Not recommended for production multi-zone deployments

Option C: Increase Zone maxSkew

maxSkew: 2  # Allows more imbalanced zone distribution
  • Warning: Permits concentration (e.g., 0-2-1 or 1-3-2 distribution)
  • Reduces protection against zone failures

Current Implementation (Recommended): Uses strict zone enforcement (DoNotSchedule, maxSkew: 1) with best-effort hostname spreading (ScheduleAnyway, maxSkew: 1) to balance zone protection with reliable scheduling.