Splunk Operator: Scaling Operations Too Restrictive

### Please select the type of request

Enhancement

### Tell us more

### Describe the request

The current implementation requires ALL pods to be in a ready state before doing anything - scale up or scale down. And honestly, it's super frustrating in real-world scenarios.

Here's what keeps happening to us:

**Scale-down is completely blocked when a pod is unhealthy**

Imagine in a simple case that we had 5 indexers running, wanted to scale down to 3 to cut costs. One indexer was in a bad state (splunkd crashed or whatever). The operator just... sat there. Doing nothing. We can't remove pods because one pod is sick? That logic doesn't really make sense - removing pods doesn't put any additional load on the cluster.

**Scale-up waits forever**

Same story but the other way. Traffic spike hit us hard, we needed more search heads ASAP. But one existing pod was having issues, so the operator refused to scale up. No timeout, no override, just "nope, wait until everything is perfect". In prod. During a traffic spike ;)

There's literally no escape hatch. We've had to manually mess with the StatefulSet or restart the operator just to get things moving again.

### Expected behavior

**For scale-down:**
- Should proceed when `replicas > desiredReplicas`, period
- The pods being removed don't need to be healthy - we're removing them anyway
- This is consistent with how K8s StatefulSets normally work

**For scale-up:**
- Would be nice to have a configurable timeout instead of waiting indefinitely
- Something like an annotation: `operator.splunk.com/scale-up-ready-wait-timeout: "5m"`
- Default behavior could stay the same (wait forever) for backward compat
- But give us an option to say "wait X minutes then just do it"

Example usage:
```yaml
apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
  name: idxc
  annotations:
    operator.splunk.com/scale-up-ready-wait-timeout: "5m"
spec:
  replicas: 5
```

If no annotation is set, keep current behavior. If set to "0s", skip waiting entirely. If set to "5m", wait up to 5 mins then proceed.

### Splunk setup on K8S

Any StatefulSet-based CR is affected:
- IndexerCluster
- SearchHeadCluster  
- ClusterManager
- Standalone
- LicenseManager
- MonitoringConsole

### Reproduction/Testing steps

**Reproduce stuck scale-down:**
1. Create an IndexerCluster with 5 replicas
2. Wait for all pods to be ready
3. Manually kill splunkd on one pod (or just delete the pod)
4. Try to scale down to 3 replicas
5. Watch nothing happen - operator just waits

**Reproduce stuck scale-up:**
1. Have a cluster with at least one unhealthy pod
2. Try to increase replicas
3. Operator waits indefinitely for the unhealthy pod to recover

### K8s environment

- Tested on EKS 1.28, should affect all K8s versions
- Same behavior on OpenShift 4.12
- Operator version: 2.6.1

### Additional context

We've worked around this by:
- Manually editing the StatefulSet (not great, operator fights back)
- Restarting the operator pod (also not great)
- Fixing the unhealthy pod first (sometimes not possible quickly)

Would really appreciate having more control over this. The current "all or nothing" approach is too rigid for production use.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Splunk Operator: Scaling Operations Too Restrictive #1646

Please select the type of request

Tell us more

Describe the request

Expected behavior

Splunk setup on K8S

Reproduction/Testing steps

K8s environment

Additional context

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Splunk Operator: Scaling Operations Too Restrictive #1646

Description

Please select the type of request

Tell us more

Describe the request

Expected behavior

Splunk setup on K8S

Reproduction/Testing steps

K8s environment

Additional context

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions