You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -125,18 +130,25 @@ Chaos testing was performed to validate multi-zone resilience by cordoning all n
125
130
126
131
**Validation Result**: Topology constraints working as designed
127
132
133
+
**Production Deployment Note**:
134
+
The hostname-level constraint uses `ScheduleAnyway` (best-effort) to ensure StatefulSets
135
+
can schedule successfully even when perfect node-level balance is not achievable. This
136
+
prevents scheduling deadlock while maintaining strict zone-level protection. Zone-level
137
+
distribution remains strictly enforced with `DoNotSchedule` to prevent concentration.
138
+
128
139
### Known Limitation (By Design)
129
140
130
141
**Complete Zone Failure Behavior**:
131
142
- When an entire availability zone becomes unavailable (all nodes cordoned/failed), affected StatefulSet pods **cannot reschedule** to remaining zones
132
143
- Pods remain in `Pending` state until the failed zone recovers
133
-
- This is the intended behavior with `maxSkew: 1` + `whenUnsatisfiable: DoNotSchedule`
144
+
- This is the intended behavior with strict zone-level constraint: `maxSkew: 1`+ `whenUnsatisfiable: DoNotSchedule`
134
145
135
146
**Why This is Acceptable**:
136
147
1. **Primary Goal Achieved**: Prevents cross-nodepool pods from concentrating in a single zone during normal operations
137
148
2. **Rare Scenario**: Complete zone failures are uncommon (Azure/AWS/GCP multi-zone SLA > 99.99%)
138
149
3. **Planned Maintenance**: Production zone maintenance is typically planned, allowing for graceful pod draining
139
150
4. **Trade-off Decision**: Temporary unavailability during zone outage vs. chronic concentration risk in normal operations
151
+
5. **Production Safety**: Hostname-level constraint uses `ScheduleAnyway` to prevent scheduling issues during normal operations while zone-level remains strict
140
152
141
153
**Recovery**:
142
154
Once the zone becomes available again, pods automatically reschedule and rebalance:
- Warning: Weakens constraint enforcement during normal operations
179
+
- Warning: Weakens primary PSCLOUD-64 protection
180
+
- Not recommended for production multi-zone deployments
157
181
158
-
**Option B: Increase `maxSkew`**
182
+
**Option C: Increase Zone maxSkew**
159
183
```yaml
160
-
maxSkew: 2 # Allows 0-2-1 distribution during zone failure
184
+
maxSkew: 2 # Allows more imbalanced zone distribution
161
185
```
162
-
- Warning: Permits less balanced distribution in normal conditions
186
+
- Warning: Permits concentration (e.g., 0-2-1 or 1-3-2 distribution)
187
+
- Reduces protection against zone failures
163
188
164
-
**Current Implementation**: Uses strict enforcement (`DoNotSchedule`, `maxSkew: 1`) to prioritize prevention of zone concentration during normal operations.
189
+
**Current Implementation (Recommended)**: Uses strict zone enforcement (`DoNotSchedule`, `maxSkew: 1`) with best-effort hostname spreading (`ScheduleAnyway`, `maxSkew: 1`) to balance zone protection with reliable scheduling.
0 commit comments