Autoscaler fails to update spec.replicas when scaling boundaries change#511
Merged
Autoscaler fails to update spec.replicas when scaling boundaries change#511
Conversation
56f4698 to
8c6000c
Compare
… spec.scaling.minReplicas Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
8c6000c to
c53f570
Compare
Collaborator
|
This could be covered by a test case e.g. one as described in the example |
Contributor
|
see #512 for a unit test |
Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
Member
Author
|
@adrobisch I cherry-picked the commit from your PR and signed off. |
Collaborator
|
ideally we would have a test covering scaleEDS to test the bug explained in the description, but this is ofc. more complicated to test. |
Collaborator
|
👍 |
1 similar comment
Member
Author
|
👍 |
3eaae66 to
7f54560
Compare
otrosien
added a commit
that referenced
this pull request
Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570) where an EDS could scale to 0 replicas despite having minReplicas set. Bug Scenario: 1. kubectl patch clears spec.replicas to nil 2. edsReplicas() returns 0 (per c53f570 change) 3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935 4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices) 5. No-op means scalingOperation.NodeReplicas = nil 6. Line 945 condition fails, so line 959 doesn't execute 7. Result: spec.replicas stays at 0, violating minReplicas Root Cause: The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS with nil replicas, intending to let the autoscaler calculate the initial value. However, when the autoscaler returns a no-op (nil NodeReplicas), there's no fallback to prevent spec.replicas from being set to 0. Fix: Add defensive check in scaleEDS() before writing to spec.replicas: - When currentReplicas == 0 and minReplicas > 0: - First try status.replicas (reflects actual StatefulSet state) - Fall back to minReplicas for new/uninitialized EDS - This prevents writing 0 when it would violate bounds Test Coverage: - Added TestScaleToZeroPrevention with 3 test cases - Validates the defensive logic for all scenarios - All existing tests continue to pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>
otrosien
added a commit
that referenced
this pull request
Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570) where an EDS could scale to 0 replicas despite having minReplicas set. Bug Scenario: 1. kubectl patch clears spec.replicas to nil 2. edsReplicas() returns 0 (per c53f570 change) 3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935 4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices) 5. No-op means scalingOperation.NodeReplicas = nil 6. Line 945 condition fails, so line 959 doesn't execute 7. Result: spec.replicas stays at 0, violating minReplicas Root Cause: The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS with nil replicas, intending to let the autoscaler calculate the initial value. However, when the autoscaler returns a no-op (nil NodeReplicas), there's no fallback to prevent spec.replicas from being set to 0. Fix: Add defensive check in scaleEDS() before writing to spec.replicas: - When currentReplicas == 0 and minReplicas > 0: - First try status.replicas (reflects actual StatefulSet state) - Fall back to minReplicas for new/uninitialized EDS - This prevents writing 0 when it would violate bounds Test Coverage: - Added TestScaleToZeroPrevention with 3 test cases - Validates the defensive logic for all scenarios - All existing tests continue to pass Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
otrosien
added a commit
that referenced
this pull request
Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570) where an EDS could scale to 0 replicas despite having minReplicas set. Bug Scenario: 1. kubectl patch clears spec.replicas to nil 2. edsReplicas() returns 0 (per c53f570 change) 3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935 4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices) 5. No-op means scalingOperation.NodeReplicas = nil 6. Line 945 condition fails, so line 959 doesn't execute 7. Result: spec.replicas stays at 0, violating minReplicas Root Cause: The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS with nil replicas, intending to let the autoscaler calculate the initial value. However, when the autoscaler returns a no-op (nil NodeReplicas), there's no fallback to prevent spec.replicas from being set to 0. Fix: Add defensive check in scaleEDS() before writing to spec.replicas: - When currentReplicas == 0 and minReplicas > 0: - First try status.replicas (reflects actual StatefulSet state) - Fall back to minReplicas for new/uninitialized EDS - This prevents writing 0 when it would violate bounds Test Coverage: - Added TestScaleToZeroPrevention with 3 test cases - Validates the defensive logic for all scenarios - All existing tests continue to pass Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Problem
When patching an ElasticsearchDataSet (EDS) to update scaling boundaries (
spec.scaling.minReplicasorspec.scaling.maxReplicas), the autoscaler correctly scales the cluster but fails to persist the new replica count tospec.replicas. This causes a mismatch between the actual running state and the desired state in the spec.Root Cause
The bug is in the
edsReplicas()function atoperator/elasticsearch.go:827-838. This function had two critical issues:Premature boundary enforcement: The function applied
math.Max(currentReplicas, minReplicas)which artificially inflated the current replica count before the autoscaler could evaluate it.Update condition bypass: When
edsReplicas()returned a value that already matched what the autoscaler wanted, the condition at line 940 (scalingOperation.NodeReplicas != currentReplicas) would evaluate to false, causing the spec update logic (lines 940-955) to be skipped entirely.Example Scenario
Before patch:
spec.replicas = 3spec.scaling.minReplicas = 2Apply patch:
With the bug:
edsReplicas()returnsmax(3, 4) = 4(premature enforcement)4 != 4→ false, spec update skippedspec.replicasstill shows 3With the fix:
edsReplicas()returns3(actual current value)4 != 3→ true, spec update triggeredspec.replicasset to 4 at line 954spec.replicascorrectly show 4Impact
This bug affects any operation that changes scaling boundaries:
minReplicasmaxReplicasspec.replicasreflecting the actual desired stateChanges
1. Remove premature boundary enforcement (
operator/elasticsearch.go:827-838)Before:
After:
Key changes:
Removed
math.Max()enforcement: The function now returns the actual current value fromspec.replicaswithout applying boundary constraints. This allows the autoscaler to detect when scaling is needed.Return 0 for nil case: When
spec.replicasis nil and autoscaling is enabled, return 0 instead ofminReplicas. This ensures the autoscaler always makes an explicit scaling decision rather than assuming the current state is already at minReplicas.Removed unused
mathimport: Sincemath.Max()is no longer used, the import is removed.Why this works:
The autoscaler (
GetScalingOperation()) is responsible for enforcing scaling boundaries and making scaling decisions. TheedsReplicas()function should only report the current state, not apply policy. By removing the premature boundary enforcement, we allow the autoscaler to:spec.replicasvalueThe boundary enforcement still happens - but at the right place: in the autoscaler's decision logic, not in the state-reading function.
Types of Changes