Skip to content

Splunk Operator: Scaling Operations Too Restrictive #1646

@ductrung-nguyen

Description

@ductrung-nguyen

Please select the type of request

Enhancement

Tell us more

Describe the request

The current implementation requires ALL pods to be in a ready state before doing anything - scale up or scale down. And honestly, it's super frustrating in real-world scenarios.

Here's what keeps happening to us:

Scale-down is completely blocked when a pod is unhealthy

Imagine in a simple case that we had 5 indexers running, wanted to scale down to 3 to cut costs. One indexer was in a bad state (splunkd crashed or whatever). The operator just... sat there. Doing nothing. We can't remove pods because one pod is sick? That logic doesn't really make sense - removing pods doesn't put any additional load on the cluster.

Scale-up waits forever

Same story but the other way. Traffic spike hit us hard, we needed more search heads ASAP. But one existing pod was having issues, so the operator refused to scale up. No timeout, no override, just "nope, wait until everything is perfect". In prod. During a traffic spike ;)

There's literally no escape hatch. We've had to manually mess with the StatefulSet or restart the operator just to get things moving again.

Expected behavior

For scale-down:

  • Should proceed when replicas > desiredReplicas, period
  • The pods being removed don't need to be healthy - we're removing them anyway
  • This is consistent with how K8s StatefulSets normally work

For scale-up:

  • Would be nice to have a configurable timeout instead of waiting indefinitely
  • Something like an annotation: operator.splunk.com/scale-up-ready-wait-timeout: "5m"
  • Default behavior could stay the same (wait forever) for backward compat
  • But give us an option to say "wait X minutes then just do it"

Example usage:

apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
  name: idxc
  annotations:
    operator.splunk.com/scale-up-ready-wait-timeout: "5m"
spec:
  replicas: 5

If no annotation is set, keep current behavior. If set to "0s", skip waiting entirely. If set to "5m", wait up to 5 mins then proceed.

Splunk setup on K8S

Any StatefulSet-based CR is affected:

  • IndexerCluster
  • SearchHeadCluster
  • ClusterManager
  • Standalone
  • LicenseManager
  • MonitoringConsole

Reproduction/Testing steps

Reproduce stuck scale-down:

  1. Create an IndexerCluster with 5 replicas
  2. Wait for all pods to be ready
  3. Manually kill splunkd on one pod (or just delete the pod)
  4. Try to scale down to 3 replicas
  5. Watch nothing happen - operator just waits

Reproduce stuck scale-up:

  1. Have a cluster with at least one unhealthy pod
  2. Try to increase replicas
  3. Operator waits indefinitely for the unhealthy pod to recover

K8s environment

  • Tested on EKS 1.28, should affect all K8s versions
  • Same behavior on OpenShift 4.12
  • Operator version: 2.6.1

Additional context

We've worked around this by:

  • Manually editing the StatefulSet (not great, operator fights back)
  • Restarting the operator pod (also not great)
  • Fixing the unhealthy pod first (sometimes not possible quickly)

Would really appreciate having more control over this. The current "all or nothing" approach is too rigid for production use.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions