Autoscaler fails to update spec.replicas when scaling boundaries change by otrosien · Pull Request #511 · zalando-incubator/es-operator

otrosien · 2025-12-09T17:48:02Z

Description

Problem

When patching an ElasticsearchDataSet (EDS) to update scaling boundaries (spec.scaling.minReplicas or spec.scaling.maxReplicas), the autoscaler correctly scales the cluster but fails to persist the new replica count to spec.replicas. This causes a mismatch between the actual running state and the desired state in the spec.

Root Cause

The bug is in the edsReplicas() function at operator/elasticsearch.go:827-838. This function had two critical issues:

Premature boundary enforcement: The function applied math.Max(currentReplicas, minReplicas) which artificially inflated the current replica count before the autoscaler could evaluate it.
Update condition bypass: When edsReplicas() returned a value that already matched what the autoscaler wanted, the condition at line 940 (scalingOperation.NodeReplicas != currentReplicas) would evaluate to false, causing the spec update logic (lines 940-955) to be skipped entirely.

Example Scenario

Before patch:

spec.replicas = 3
spec.scaling.minReplicas = 2
Cluster running with 3 nodes

Apply patch:

kubectl patch eds my-cluster --type=json -p '[
  {"op": "replace", "path": "/spec/scaling/minReplicas", "value": 4},
  {"op": "replace", "path": "/spec/scaling/maxReplicas", "value": 4}
]'

With the bug:

edsReplicas() returns max(3, 4) = 4 (premature enforcement)
Autoscaler calculates target replicas = 4
Check 4 != 4 → false, spec update skipped
Cluster scales to 4 nodes (via StatefulSet reconciliation)
Result: Cluster has 4 nodes but spec.replicas still shows 3

With the fix:

edsReplicas() returns 3 (actual current value)
Autoscaler calculates target replicas = 4
Check 4 != 3 → true, spec update triggered
spec.replicas set to 4 at line 954
Update persisted at line 968
Result: Both cluster and spec.replicas correctly show 4

Impact

This bug affects any operation that changes scaling boundaries:

Scaling up by increasing minReplicas
Scaling down by decreasing maxReplicas
Any automation that relies on spec.replicas reflecting the actual desired state

Changes

1. Remove premature boundary enforcement (`operator/elasticsearch.go:827-838`)

Before:

func edsReplicas(eds *zv1.ElasticsearchDataSet) int32 {
    scaling := eds.Spec.Scaling
    if scaling == nil || !scaling.Enabled {
        if eds.Spec.Replicas == nil {
            return 1
        }
        return *eds.Spec.Replicas
    }
    // initialize with minReplicas
    minReplicas := eds.Spec.Scaling.MinReplicas
    if eds.Spec.Replicas == nil {
        return minReplicas
    }
    currentReplicas := *eds.Spec.Replicas
    return int32(math.Max(float64(currentReplicas), float64(scaling.MinReplicas)))
}

After:

func edsReplicas(eds *zv1.ElasticsearchDataSet) int32 {
    scaling := eds.Spec.Scaling
    if scaling == nil || !scaling.Enabled {
        if eds.Spec.Replicas == nil {
            return 1
        }
        return *eds.Spec.Replicas
    }
    // initialize with 0
    if eds.Spec.Replicas == nil {
        return 0
    }
    return *eds.Spec.Replicas
}

Key changes:

Removed math.Max() enforcement: The function now returns the actual current value from spec.replicas without applying boundary constraints. This allows the autoscaler to detect when scaling is needed.
Return 0 for nil case: When spec.replicas is nil and autoscaling is enabled, return 0 instead of minReplicas. This ensures the autoscaler always makes an explicit scaling decision rather than assuming the current state is already at minReplicas.
Removed unused math import: Since math.Max() is no longer used, the import is removed.

Why this works:

The autoscaler (GetScalingOperation()) is responsible for enforcing scaling boundaries and making scaling decisions. The edsReplicas() function should only report the current state, not apply policy. By removing the premature boundary enforcement, we allow the autoscaler to:

See the actual current state (e.g., 3 replicas)
Compare it against the new boundaries (e.g., minReplicas = 4)
Detect that scaling is needed (3 < 4)
Trigger the update logic that persists the new spec.replicas value

The boundary enforcement still happens - but at the right place: in the autoscaler's decision logic, not in the state-reading function.

Types of Changes

Bug fix (non-breaking change which fixes an issue)

… spec.scaling.minReplicas Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

mikkeloscar · 2025-12-10T09:09:43Z

This could be covered by a test case e.g. one as described in the example

adrobisch · 2025-12-10T09:33:42Z

see #512 for a unit test

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

otrosien · 2025-12-10T09:38:39Z

@adrobisch I cherry-picked the commit from your PR and signed off.

mikkeloscar · 2025-12-10T10:08:28Z

ideally we would have a test covering scaleEDS to test the bug explained in the description, but this is ofc. more complicated to test.

mikkeloscar · 2025-12-10T10:08:33Z

👍

otrosien · 2025-12-10T11:10:09Z

👍

This fixes a critical regression introduced in PR #511 (commit c53f570) where an EDS could scale to 0 replicas despite having minReplicas set. Bug Scenario: 1. kubectl patch clears spec.replicas to nil 2. edsReplicas() returns 0 (per c53f570 change) 3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935 4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices) 5. No-op means scalingOperation.NodeReplicas = nil 6. Line 945 condition fails, so line 959 doesn't execute 7. Result: spec.replicas stays at 0, violating minReplicas Root Cause: The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS with nil replicas, intending to let the autoscaler calculate the initial value. However, when the autoscaler returns a no-op (nil NodeReplicas), there's no fallback to prevent spec.replicas from being set to 0. Fix: Add defensive check in scaleEDS() before writing to spec.replicas: - When currentReplicas == 0 and minReplicas > 0: - First try status.replicas (reflects actual StatefulSet state) - Fall back to minReplicas for new/uninitialized EDS - This prevents writing 0 when it would violate bounds Test Coverage: - Added TestScaleToZeroPrevention with 3 test cases - Validates the defensive logic for all scenarios - All existing tests continue to pass 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

This fixes a critical regression introduced in PR #511 (commit c53f570) where an EDS could scale to 0 replicas despite having minReplicas set. Bug Scenario: 1. kubectl patch clears spec.replicas to nil 2. edsReplicas() returns 0 (per c53f570 change) 3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935 4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices) 5. No-op means scalingOperation.NodeReplicas = nil 6. Line 945 condition fails, so line 959 doesn't execute 7. Result: spec.replicas stays at 0, violating minReplicas Root Cause: The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS with nil replicas, intending to let the autoscaler calculate the initial value. However, when the autoscaler returns a no-op (nil NodeReplicas), there's no fallback to prevent spec.replicas from being set to 0. Fix: Add defensive check in scaleEDS() before writing to spec.replicas: - When currentReplicas == 0 and minReplicas > 0: - First try status.replicas (reflects actual StatefulSet state) - Fall back to minReplicas for new/uninitialized EDS - This prevents writing 0 when it would violate bounds Test Coverage: - Added TestScaleToZeroPrevention with 3 test cases - Validates the defensive logic for all scenarios - All existing tests continue to pass Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

otrosien added the bugfix Bug fixes and patches, e.g. fixing of a production issue that is affecting the customer experience. label Dec 9, 2025

otrosien changed the title ~~Fix spec.replicas screwup when patching an EDS~~ Autoscaler fails to update spec.replicas when scaling boundaries change Dec 9, 2025

otrosien force-pushed the patch-eds-fix branch from 56f4698 to 8c6000c Compare December 10, 2025 05:42

otrosien marked this pull request as ready for review December 10, 2025 05:42

otrosien requested a review from mikkeloscar as a code owner December 10, 2025 05:42

Fix condition where due to patching spec.replicas became smaller than…

c53f570

… spec.scaling.minReplicas Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

otrosien force-pushed the patch-eds-fix branch from 8c6000c to c53f570 Compare December 10, 2025 05:53

add test for edsReplicas

7f54560

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

otrosien mentioned this pull request Dec 10, 2025

test: test for edsReplicas #512

Closed

otrosien force-pushed the patch-eds-fix branch 2 times, most recently from 3eaae66 to 7f54560 Compare December 10, 2025 11:24

otrosien merged commit 74d94cf into master Dec 10, 2025
12 of 14 checks passed

otrosien deleted the patch-eds-fix branch December 10, 2025 11:25

girishc13 mentioned this pull request Dec 11, 2025

Revert "Autoscaler fails to update spec.replicas when scaling boundaries change" #514

Closed

otrosien mentioned this pull request Dec 12, 2025

Fix scale-to-zero bug when spec.replicas is nil #515

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Autoscaler fails to update spec.replicas when scaling boundaries change#511

Autoscaler fails to update spec.replicas when scaling boundaries change#511
otrosien merged 2 commits intomasterfrom
patch-eds-fix

otrosien commented Dec 9, 2025 •

edited

Loading

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

adrobisch commented Dec 10, 2025

Uh oh!

otrosien commented Dec 10, 2025

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

otrosien commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

otrosien commented Dec 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Problem

Root Cause

Example Scenario

Impact

Changes

1. Remove premature boundary enforcement (operator/elasticsearch.go:827-838)

Key changes:

Why this works:

Types of Changes

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

adrobisch commented Dec 10, 2025

Uh oh!

otrosien commented Dec 10, 2025

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

mikkeloscar commented Dec 10, 2025

Uh oh!

otrosien commented Dec 10, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

otrosien commented Dec 9, 2025 •

edited

Loading

1. Remove premature boundary enforcement (`operator/elasticsearch.go:827-838`)