Skip to content

Autoscaler fails to update spec.replicas when scaling boundaries change#511

Merged
otrosien merged 2 commits intomasterfrom
patch-eds-fix
Dec 10, 2025
Merged

Autoscaler fails to update spec.replicas when scaling boundaries change#511
otrosien merged 2 commits intomasterfrom
patch-eds-fix

Conversation

@otrosien
Copy link
Member

@otrosien otrosien commented Dec 9, 2025

Description

Problem

When patching an ElasticsearchDataSet (EDS) to update scaling boundaries (spec.scaling.minReplicas or spec.scaling.maxReplicas), the autoscaler correctly scales the cluster but fails to persist the new replica count to spec.replicas. This causes a mismatch between the actual running state and the desired state in the spec.

Root Cause

The bug is in the edsReplicas() function at operator/elasticsearch.go:827-838. This function had two critical issues:

  1. Premature boundary enforcement: The function applied math.Max(currentReplicas, minReplicas) which artificially inflated the current replica count before the autoscaler could evaluate it.

  2. Update condition bypass: When edsReplicas() returned a value that already matched what the autoscaler wanted, the condition at line 940 (scalingOperation.NodeReplicas != currentReplicas) would evaluate to false, causing the spec update logic (lines 940-955) to be skipped entirely.

Example Scenario

Before patch:

  • spec.replicas = 3
  • spec.scaling.minReplicas = 2
  • Cluster running with 3 nodes

Apply patch:

kubectl patch eds my-cluster --type=json -p '[
  {"op": "replace", "path": "/spec/scaling/minReplicas", "value": 4},
  {"op": "replace", "path": "/spec/scaling/maxReplicas", "value": 4}
]'

With the bug:

  1. edsReplicas() returns max(3, 4) = 4 (premature enforcement)
  2. Autoscaler calculates target replicas = 4
  3. Check 4 != 4 → false, spec update skipped
  4. Cluster scales to 4 nodes (via StatefulSet reconciliation)
  5. Result: Cluster has 4 nodes but spec.replicas still shows 3

With the fix:

  1. edsReplicas() returns 3 (actual current value)
  2. Autoscaler calculates target replicas = 4
  3. Check 4 != 3 → true, spec update triggered
  4. spec.replicas set to 4 at line 954
  5. Update persisted at line 968
  6. Result: Both cluster and spec.replicas correctly show 4

Impact

This bug affects any operation that changes scaling boundaries:

  • Scaling up by increasing minReplicas
  • Scaling down by decreasing maxReplicas
  • Any automation that relies on spec.replicas reflecting the actual desired state

Changes

1. Remove premature boundary enforcement (operator/elasticsearch.go:827-838)

Before:

func edsReplicas(eds *zv1.ElasticsearchDataSet) int32 {
    scaling := eds.Spec.Scaling
    if scaling == nil || !scaling.Enabled {
        if eds.Spec.Replicas == nil {
            return 1
        }
        return *eds.Spec.Replicas
    }
    // initialize with minReplicas
    minReplicas := eds.Spec.Scaling.MinReplicas
    if eds.Spec.Replicas == nil {
        return minReplicas
    }
    currentReplicas := *eds.Spec.Replicas
    return int32(math.Max(float64(currentReplicas), float64(scaling.MinReplicas)))
}

After:

func edsReplicas(eds *zv1.ElasticsearchDataSet) int32 {
    scaling := eds.Spec.Scaling
    if scaling == nil || !scaling.Enabled {
        if eds.Spec.Replicas == nil {
            return 1
        }
        return *eds.Spec.Replicas
    }
    // initialize with 0
    if eds.Spec.Replicas == nil {
        return 0
    }
    return *eds.Spec.Replicas
}

Key changes:

  • Removed math.Max() enforcement: The function now returns the actual current value from spec.replicas without applying boundary constraints. This allows the autoscaler to detect when scaling is needed.

  • Return 0 for nil case: When spec.replicas is nil and autoscaling is enabled, return 0 instead of minReplicas. This ensures the autoscaler always makes an explicit scaling decision rather than assuming the current state is already at minReplicas.

  • Removed unused math import: Since math.Max() is no longer used, the import is removed.

Why this works:

The autoscaler (GetScalingOperation()) is responsible for enforcing scaling boundaries and making scaling decisions. The edsReplicas() function should only report the current state, not apply policy. By removing the premature boundary enforcement, we allow the autoscaler to:

  1. See the actual current state (e.g., 3 replicas)
  2. Compare it against the new boundaries (e.g., minReplicas = 4)
  3. Detect that scaling is needed (3 < 4)
  4. Trigger the update logic that persists the new spec.replicas value

The boundary enforcement still happens - but at the right place: in the autoscaler's decision logic, not in the state-reading function.

Types of Changes

  • Bug fix (non-breaking change which fixes an issue)

@otrosien otrosien added the bugfix Bug fixes and patches, e.g. fixing of a production issue that is affecting the customer experience. label Dec 9, 2025
@otrosien otrosien changed the title Fix spec.replicas screwup when patching an EDS Autoscaler fails to update spec.replicas when scaling boundaries change Dec 9, 2025
@otrosien otrosien marked this pull request as ready for review December 10, 2025 05:42
… spec.scaling.minReplicas

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
@mikkeloscar
Copy link
Collaborator

This could be covered by a test case e.g. one as described in the example

@adrobisch
Copy link
Contributor

see #512 for a unit test

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
@otrosien
Copy link
Member Author

@adrobisch I cherry-picked the commit from your PR and signed off.

@mikkeloscar
Copy link
Collaborator

ideally we would have a test covering scaleEDS to test the bug explained in the description, but this is ofc. more complicated to test.

@mikkeloscar
Copy link
Collaborator

👍

1 similar comment
@otrosien
Copy link
Member Author

👍

@otrosien otrosien force-pushed the patch-eds-fix branch 2 times, most recently from 3eaae66 to 7f54560 Compare December 10, 2025 11:24
@otrosien otrosien merged commit 74d94cf into master Dec 10, 2025
12 of 14 checks passed
@otrosien otrosien deleted the patch-eds-fix branch December 10, 2025 11:25
otrosien added a commit that referenced this pull request Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570)
where an EDS could scale to 0 replicas despite having minReplicas set.

Bug Scenario:
1. kubectl patch clears spec.replicas to nil
2. edsReplicas() returns 0 (per c53f570 change)
3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935
4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices)
5. No-op means scalingOperation.NodeReplicas = nil
6. Line 945 condition fails, so line 959 doesn't execute
7. Result: spec.replicas stays at 0, violating minReplicas

Root Cause:
The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS
with nil replicas, intending to let the autoscaler calculate the initial
value. However, when the autoscaler returns a no-op (nil NodeReplicas),
there's no fallback to prevent spec.replicas from being set to 0.

Fix:
Add defensive check in scaleEDS() before writing to spec.replicas:
- When currentReplicas == 0 and minReplicas > 0:
  - First try status.replicas (reflects actual StatefulSet state)
  - Fall back to minReplicas for new/uninitialized EDS
- This prevents writing 0 when it would violate bounds

Test Coverage:
- Added TestScaleToZeroPrevention with 3 test cases
- Validates the defensive logic for all scenarios
- All existing tests continue to pass

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
otrosien added a commit that referenced this pull request Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570)
where an EDS could scale to 0 replicas despite having minReplicas set.

Bug Scenario:
1. kubectl patch clears spec.replicas to nil
2. edsReplicas() returns 0 (per c53f570 change)
3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935
4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices)
5. No-op means scalingOperation.NodeReplicas = nil
6. Line 945 condition fails, so line 959 doesn't execute
7. Result: spec.replicas stays at 0, violating minReplicas

Root Cause:
The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS
with nil replicas, intending to let the autoscaler calculate the initial
value. However, when the autoscaler returns a no-op (nil NodeReplicas),
there's no fallback to prevent spec.replicas from being set to 0.

Fix:
Add defensive check in scaleEDS() before writing to spec.replicas:
- When currentReplicas == 0 and minReplicas > 0:
  - First try status.replicas (reflects actual StatefulSet state)
  - Fall back to minReplicas for new/uninitialized EDS
- This prevents writing 0 when it would violate bounds

Test Coverage:
- Added TestScaleToZeroPrevention with 3 test cases
- Validates the defensive logic for all scenarios
- All existing tests continue to pass

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
otrosien added a commit that referenced this pull request Dec 12, 2025
This fixes a critical regression introduced in PR #511 (commit c53f570)
where an EDS could scale to 0 replicas despite having minReplicas set.

Bug Scenario:
1. kubectl patch clears spec.replicas to nil
2. edsReplicas() returns 0 (per c53f570 change)
3. scaleEDS() writes eds.Spec.Replicas = &0 at line 935
4. Autoscaler returns no-op (e.g., excludeSystemIndices filters all indices)
5. No-op means scalingOperation.NodeReplicas = nil
6. Line 945 condition fails, so line 959 doesn't execute
7. Result: spec.replicas stays at 0, violating minReplicas

Root Cause:
The c53f570 change made edsReplicas() return 0 for autoscaling-enabled EDS
with nil replicas, intending to let the autoscaler calculate the initial
value. However, when the autoscaler returns a no-op (nil NodeReplicas),
there's no fallback to prevent spec.replicas from being set to 0.

Fix:
Add defensive check in scaleEDS() before writing to spec.replicas:
- When currentReplicas == 0 and minReplicas > 0:
  - First try status.replicas (reflects actual StatefulSet state)
  - Fall back to minReplicas for new/uninitialized EDS
- This prevents writing 0 when it would violate bounds

Test Coverage:
- Added TestScaleToZeroPrevention with 3 test cases
- Validates the defensive logic for all scenarios
- All existing tests continue to pass

Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bugfix Bug fixes and patches, e.g. fixing of a production issue that is affecting the customer experience.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants