Refactor replica handling to prevent scale-to-zero scenarios by adrobisch · Pull Request #518 · zalando-incubator/es-operator

adrobisch · 2026-01-12T14:18:06Z

One-line summary

Refactors the replicas values of the EDS and related fallbacks/interfaces, and implements defense-in-depth protection against scale-to-zero scenarios.

Description

Problem

After #516, we identified a regression where the operator would scale the StatefulSet of an existing EDS to zero replicas under specific conditions:

Regression scenario:

spec.scaling.enabled: true
spec.excludeSystemIndices: true
Only system indices exist in the cluster
Autoscaler.calculateScalingOperation() returns early with noop because len(managedIndices) == 0
edsReplicas() = Replicas() returns 0 because scaling is enabled and spec.replicas is nil
operatePods scales to 0, causing complete cluster unavailability

Root Cause

The issue stemmed from ambiguous handling of nil vs. zero replica values:

The autoscaler couldn't distinguish between "not yet initialized" (nil) and "explicitly set to zero" (0)
When spec.replicas was nil, different code paths made conflicting assumptions
No validation prevented minReplicas=0 when autoscaling was enabled
Fallback logic could result in zero replicas under edge cases

Solution

Part 1: Pointer-based Replicas API (commits `c8daefd`, `89e1a58`, `f1d176d`)

Key changes:

Replicas() and edsReplicas() now return *int32 (pointer) instead of int32
nil explicitly means "no safe/defined value yet" - don't touch replicas
Only scaleEDS() initializes spec.replicas with defensive fallback logic
rescaleStatefulSet() and operatePods() bail out early if desired replicas are nil
Split rescaleStatefulSet() into smaller functions to reduce cyclomatic complexity

Benefits:

Clear semantic distinction: nil = uninitialized, non-nil = explicit value
Prevents accidental writes of zero during initialization
Makes control flow more explicit and easier to reason about

Part 2: Defense-in-Depth Scale-to-Zero Prevention (commit 008d647)

To ensure scale-to-zero never happens under any circumstance, implemented five layers of protection:

Layer 1: API Validation

File: operator/elasticsearch.go:1046-1095

Added validation to require minReplicas >= 1 when autoscaling is enabled:

if scaling.MinReplicas < 1 {
    return fmt.Errorf(
        "minReplicas must be at least 1 when autoscaling is enabled (got %d)",
        scaling.MinReplicas,
    )
}

Protects against: Configuration errors at resource creation/update time

Layer 2: Autoscaler Bounds Enforcement

File: operator/autoscaler.go:248-258

Enhanced ensureBoundsNodeReplicas() to enforce an absolute minimum of 1:

// Enforce absolute minimum of 1 replica to prevent scale-to-zero
if newDesiredNodeReplicas < 1 {
    as.logger.Warnf("EDS %s/%s: Requested to scale to %d, enforcing minimum of 1 replica.",
        as.eds.Namespace, as.eds.Name, newDesiredNodeReplicas)
    return 1
}

// Enforce minReplicas with absolute floor of 1
effectiveMinReplicas := scalingSpec.MinReplicas
if effectiveMinReplicas < 1 {
    effectiveMinReplicas = 1
}

Protects against: Autoscaler calculations that result in zero (edge cases, bugs)

Layer 3: Fallback Logic Safety

File: operator/elasticsearch.go:938-952

Updated scaleEDS() fallback to never allow 0:

if currentReplicasPtr == nil {
    // Default fallback value (minimum 1)
    desired := int32(1)

    if scaling != nil && scaling.MinReplicas > 0 {
        desired = scaling.MinReplicas
        if eds.Status.Replicas > 0 && eds.Status.Replicas > desired {
            desired = eds.Status.Replicas
        }
    } else if eds.Status.Replicas > 0 {
        desired = eds.Status.Replicas
    }

    // Absolute safety check
    if desired < 1 {
        log.Infof("EDS %s/%s: Fallback calculation resulted in %d, enforcing minimum of 1",
            eds.Namespace, eds.Name, desired)
        desired = 1
    }
}

Protects against: Initialization with zero when status and minReplicas are both zero

Layer 4: Replica Calculation Safety

File: operator/elasticsearch.go:826-845

Updated edsReplicas() to enforce minimum of 1:

// Use max(minReplicas, 1) as base to prevent scale-to-zero
desired := scaling.MinReplicas
if desired < 1 {
    desired = 1
}

Protects against: Calculation logic returning zero in any scenario

Layer 5: Operator Reconciliation Safety

File: operator/operator.go:187-198

Added validation before creating/updating StatefulSet:

// Safety check to prevent scale-to-zero
if desiredReplicas != nil && *desiredReplicas < 1 {
    return nil, fmt.Errorf(
        "refusing to scale StatefulSet %s/%s to %d replicas (minimum is 1)",
        sr.Namespace(), sr.Name(), *desiredReplicas,
    )
}

Protects against: Any code path that attempts to write zero to StatefulSet

Observability Enhancements

Added structured logging at all critical decision points:

Warning when autoscaler enforces minimum of 1 replica
Info when fallback logic enforces minimum of 1
Debug when rescaleStatefulSet bails out due to nil replicas

This ensures operators can debug scale-to-zero prevention in production.

Comprehensive Test Coverage

Added three new test functions:

TestValidateScalingSettingsRejectsZeroMinReplicas (elasticsearch_test.go)
- Validates that minReplicas=0 with autoscaling enabled is rejected
TestAutoscalerEnforcesMinimumOneReplica (autoscaler_test.go)
- Tests that ensureBoundsNodeReplicas() never returns 0
- Tests with inputs: -1, 0, 1, 5 → all return >= 1
TestEDSReplicasEnforcesMinimumOne (elasticsearch_test.go)
- Verifies edsReplicas() enforces minimum of 1
- Tests edge cases: status=0, status=5, etc.

Updated existing test documentation to clarify that minReplicas=0 should be rejected by validation.

Key API Changes

Before:

type StatefulResource interface {
    Replicas() int32  // Ambiguous: 0 could mean "not set" or "scale to zero"
}

After:

type StatefulResource interface {
    Replicas() *int32  // Explicit: nil = "not set", non-nil = explicit value
}

Edge Cases Addressed

The defense-in-depth approach protects against:

Configuration errors: minReplicas=0 rejected by validation
Manual kubectl patches: kubectl patch eds --type=merge -p '{"spec":{"replicas":0}}' → caught by Layer 5
Autoscaler edge cases: excludeSystemIndices=true + only system indices → caught by Layers 2-4
Status initialization: New EDS with status=0 → fallback ensures >= 1
Nil replicas during reconciliation: Early bailout prevents writes
Concurrent updates: Kubernetes API optimistic concurrency handles conflicts

Testing

Key test results:

TestValidateScalingSettings/test_minReplicas_=_0_with_autoscaling_enabled_(scale-to-zero_prevention)
TestAutoscalerEnforcesMinimumOneReplica
TestEDSReplicasEnforcesMinimumOne
All existing autoscaler and operator tests

Migration & Backward Compatibility

Existing EDS resources: No migration needed

Resources with minReplicas >= 1 continue working unchanged
Resources with minReplicas = 0 will fail validation on next update (intended behavior)

Recommended action for operators:

Audit existing EDS resources: kubectl get eds -A -o json | jq '.items[] | select(.spec.scaling.minReplicas < 1)'
Update any with minReplicas=0 to minReplicas=1 or higher
Deploy updated operator

Related Issues

Fixes regression after Fix scale-to-zero bug when spec.replicas is nil #515 and autoscaler: enforce bounds when scaling hint is NONE #516
Addresses scale-to-zero scenarios discussed with @otrosien
Implements recommendations from code review analysis

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

Implements multiple layers of protection to ensure ElasticsearchDataSets never scale to zero replicas, which would cause complete cluster unavailability: - Layer 1: Add validation requiring minReplicas >= 1 when autoscaling enabled - Layer 2: Enforce absolute minimum of 1 in autoscaler bounds enforcement - Layer 3: Update scaleEDS fallback logic to never allow 0 - Layer 4: Ensure edsReplicas enforces minimum of 1 in calculations - Layer 5: Add safety check in reconcileStatefulset before StatefulSet ops Also adds structured logging at key decision points and comprehensive test coverage to verify scale-to-zero prevention works across all code paths. This strengthens the existing scale-to-zero regression fix by adding multiple safety nets that catch edge cases including: - Configuration errors (minReplicas=0) - Manual kubectl patches (spec.replicas=0) - Autoscaler edge cases (excludeSystemIndices filtering all indices) Co-Authored-By: Claude <noreply@anthropic.com> Signed-off-by: Oliver Trosien <oliver.trosien@zalando.de>

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

otrosien · 2026-01-19T17:47:06Z

👍

adrobisch · 2026-01-19T17:49:54Z

👍

adrobisch added 2 commits January 12, 2026 14:36

refactor: make desired replicas as pointer and adjust usages

c8daefd

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

remove replicas wrapper function

89e1a58

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

adrobisch requested review from mikkeloscar and otrosien as code owners January 12, 2026 14:18

split rescaleStatefulSet to reduce cyclomatic complexity

f1d176d

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

otrosien self-assigned this Jan 15, 2026

otrosien added the major Major feature changes or updates, e.g. feature rollout to a new country, new API calls. label Jan 15, 2026

otrosien changed the title ~~Refactor desired replicas~~ Refactor replica handling to prevent scale-to-zero scenarios Jan 17, 2026

refactor(elasticsearch): extract update method

cf1ece5

Signed-off-by: Andreas Drobisch <dro@unkonstant.de>

otrosien merged commit 1d2772e into zalando-incubator:master Jan 19, 2026
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor replica handling to prevent scale-to-zero scenarios#518

Refactor replica handling to prevent scale-to-zero scenarios#518
otrosien merged 5 commits intozalando-incubator:masterfrom
adrobisch:refactor-desired-replicas

adrobisch commented Jan 12, 2026 •

edited by otrosien

Loading

Uh oh!

otrosien commented Jan 19, 2026

Uh oh!

adrobisch commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

adrobisch commented Jan 12, 2026 • edited by otrosien Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

One-line summary

Description

Problem

Root Cause

Solution

Part 1: Pointer-based Replicas API (commits c8daefd, 89e1a58, f1d176d)

Part 2: Defense-in-Depth Scale-to-Zero Prevention (commit 008d647)

Layer 1: API Validation

Layer 2: Autoscaler Bounds Enforcement

Layer 3: Fallback Logic Safety

Layer 4: Replica Calculation Safety

Layer 5: Operator Reconciliation Safety

Observability Enhancements

Comprehensive Test Coverage

Key API Changes

Edge Cases Addressed

Testing

Migration & Backward Compatibility

Related Issues

Uh oh!

otrosien commented Jan 19, 2026

Uh oh!

adrobisch commented Jan 19, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

adrobisch commented Jan 12, 2026 •

edited by otrosien

Loading

Part 1: Pointer-based Replicas API (commits `c8daefd`, `89e1a58`, `f1d176d`)