-
Notifications
You must be signed in to change notification settings - Fork 127
Description
Please select the type of request
Enhancement
Tell us more
Describe the request
The current implementation requires ALL pods to be in a ready state before doing anything - scale up or scale down. And honestly, it's super frustrating in real-world scenarios.
Here's what keeps happening to us:
Scale-down is completely blocked when a pod is unhealthy
Imagine in a simple case that we had 5 indexers running, wanted to scale down to 3 to cut costs. One indexer was in a bad state (splunkd crashed or whatever). The operator just... sat there. Doing nothing. We can't remove pods because one pod is sick? That logic doesn't really make sense - removing pods doesn't put any additional load on the cluster.
Scale-up waits forever
Same story but the other way. Traffic spike hit us hard, we needed more search heads ASAP. But one existing pod was having issues, so the operator refused to scale up. No timeout, no override, just "nope, wait until everything is perfect". In prod. During a traffic spike ;)
There's literally no escape hatch. We've had to manually mess with the StatefulSet or restart the operator just to get things moving again.
Expected behavior
For scale-down:
- Should proceed when
replicas > desiredReplicas, period - The pods being removed don't need to be healthy - we're removing them anyway
- This is consistent with how K8s StatefulSets normally work
For scale-up:
- Would be nice to have a configurable timeout instead of waiting indefinitely
- Something like an annotation:
operator.splunk.com/scale-up-ready-wait-timeout: "5m" - Default behavior could stay the same (wait forever) for backward compat
- But give us an option to say "wait X minutes then just do it"
Example usage:
apiVersion: enterprise.splunk.com/v4
kind: IndexerCluster
metadata:
name: idxc
annotations:
operator.splunk.com/scale-up-ready-wait-timeout: "5m"
spec:
replicas: 5If no annotation is set, keep current behavior. If set to "0s", skip waiting entirely. If set to "5m", wait up to 5 mins then proceed.
Splunk setup on K8S
Any StatefulSet-based CR is affected:
- IndexerCluster
- SearchHeadCluster
- ClusterManager
- Standalone
- LicenseManager
- MonitoringConsole
Reproduction/Testing steps
Reproduce stuck scale-down:
- Create an IndexerCluster with 5 replicas
- Wait for all pods to be ready
- Manually kill splunkd on one pod (or just delete the pod)
- Try to scale down to 3 replicas
- Watch nothing happen - operator just waits
Reproduce stuck scale-up:
- Have a cluster with at least one unhealthy pod
- Try to increase replicas
- Operator waits indefinitely for the unhealthy pod to recover
K8s environment
- Tested on EKS 1.28, should affect all K8s versions
- Same behavior on OpenShift 4.12
- Operator version: 2.6.1
Additional context
We've worked around this by:
- Manually editing the StatefulSet (not great, operator fights back)
- Restarting the operator pod (also not great)
- Fixing the unhealthy pod first (sometimes not possible quickly)
Would really appreciate having more control over this. The current "all or nothing" approach is too rigid for production use.