Refactor replica handling to prevent scale-to-zero scenarios#518
Merged
otrosien merged 5 commits intozalando-incubator:masterfrom Jan 19, 2026
Merged
Conversation
Signed-off-by: Andreas Drobisch <dro@unkonstant.de>
Signed-off-by: Andreas Drobisch <dro@unkonstant.de>
Signed-off-by: Andreas Drobisch <dro@unkonstant.de>
Implements multiple layers of protection to ensure ElasticsearchDataSets never scale to zero replicas, which would cause complete cluster unavailability: - Layer 1: Add validation requiring minReplicas >= 1 when autoscaling enabled - Layer 2: Enforce absolute minimum of 1 in autoscaler bounds enforcement - Layer 3: Update scaleEDS fallback logic to never allow 0 - Layer 4: Ensure edsReplicas enforces minimum of 1 in calculations - Layer 5: Add safety check in reconcileStatefulset before StatefulSet ops Also adds structured logging at key decision points and comprehensive test coverage to verify scale-to-zero prevention works across all code paths. This strengthens the existing scale-to-zero regression fix by adding multiple safety nets that catch edge cases including: - Configuration errors (minReplicas=0) - Manual kubectl patches (spec.replicas=0) - Autoscaler edge cases (excludeSystemIndices filtering all indices) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
Signed-off-by: Andreas Drobisch <dro@unkonstant.de>
Member
|
👍 |
1 similar comment
Contributor
Author
|
👍 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
One-line summary
Refactors the replicas values of the EDS and related fallbacks/interfaces, and implements defense-in-depth protection against scale-to-zero scenarios.
Description
Problem
After #516, we identified a regression where the operator would scale the StatefulSet of an existing EDS to zero replicas under specific conditions:
Regression scenario:
spec.scaling.enabled: truespec.excludeSystemIndices: trueAutoscaler.calculateScalingOperation()returns early withnoopbecauselen(managedIndices) == 0edsReplicas() = Replicas()returns0because scaling is enabled andspec.replicasis niloperatePodsscales to0, causing complete cluster unavailabilityRoot Cause
The issue stemmed from ambiguous handling of nil vs. zero replica values:
spec.replicaswas nil, different code paths made conflicting assumptionsminReplicas=0when autoscaling was enabledSolution
Part 1: Pointer-based Replicas API (commits c8daefd, 89e1a58, f1d176d)
Key changes:
Replicas()andedsReplicas()now return*int32(pointer) instead ofint32nilexplicitly means "no safe/defined value yet" - don't touch replicasscaleEDS()initializesspec.replicaswith defensive fallback logicrescaleStatefulSet()andoperatePods()bail out early if desired replicas arenilrescaleStatefulSet()into smaller functions to reduce cyclomatic complexityBenefits:
nil= uninitialized, non-nil = explicit valuePart 2: Defense-in-Depth Scale-to-Zero Prevention (commit 008d647)
To ensure scale-to-zero never happens under any circumstance, implemented five layers of protection:
Layer 1: API Validation
File:
operator/elasticsearch.go:1046-1095Added validation to require
minReplicas >= 1when autoscaling is enabled:Protects against: Configuration errors at resource creation/update time
Layer 2: Autoscaler Bounds Enforcement
File:
operator/autoscaler.go:248-258Enhanced
ensureBoundsNodeReplicas()to enforce an absolute minimum of 1:Protects against: Autoscaler calculations that result in zero (edge cases, bugs)
Layer 3: Fallback Logic Safety
File:
operator/elasticsearch.go:938-952Updated
scaleEDS()fallback to never allow 0:Protects against: Initialization with zero when status and minReplicas are both zero
Layer 4: Replica Calculation Safety
File:
operator/elasticsearch.go:826-845Updated
edsReplicas()to enforce minimum of 1:Protects against: Calculation logic returning zero in any scenario
Layer 5: Operator Reconciliation Safety
File:
operator/operator.go:187-198Added validation before creating/updating StatefulSet:
Protects against: Any code path that attempts to write zero to StatefulSet
Observability Enhancements
Added structured logging at all critical decision points:
This ensures operators can debug scale-to-zero prevention in production.
Comprehensive Test Coverage
Added three new test functions:
TestValidateScalingSettingsRejectsZeroMinReplicas (
elasticsearch_test.go)minReplicas=0with autoscaling enabled is rejectedTestAutoscalerEnforcesMinimumOneReplica (
autoscaler_test.go)ensureBoundsNodeReplicas()never returns 0TestEDSReplicasEnforcesMinimumOne (
elasticsearch_test.go)edsReplicas()enforces minimum of 1Updated existing test documentation to clarify that
minReplicas=0should be rejected by validation.Key API Changes
Before:
After:
Edge Cases Addressed
The defense-in-depth approach protects against:
minReplicas=0rejected by validationkubectl patch eds --type=merge -p '{"spec":{"replicas":0}}'→ caught by Layer 5excludeSystemIndices=true+ only system indices → caught by Layers 2-4Testing
Key test results:
TestValidateScalingSettings/test_minReplicas_=_0_with_autoscaling_enabled_(scale-to-zero_prevention)TestAutoscalerEnforcesMinimumOneReplicaTestEDSReplicasEnforcesMinimumOneMigration & Backward Compatibility
Existing EDS resources: No migration needed
minReplicas >= 1continue working unchangedminReplicas = 0will fail validation on next update (intended behavior)Recommended action for operators:
kubectl get eds -A -o json | jq '.items[] | select(.spec.scaling.minReplicas < 1)'minReplicas=0tominReplicas=1or higherRelated Issues