-
Notifications
You must be signed in to change notification settings - Fork 82
Open
Description
We experienced a recent cluster control-plane outtage with wave as part of the event chain.
Root Cause: A separate controller would constantly update a secret (due to a bug). Shortly after that wave would updated the Deployment which would in turn create a new ReplicaSet. Since wave would update the Deployment faster than the pods could get ready Kubernetes would keep all ReplicaSets. At around 6k ReplicaSets our Kubernetes API servers went OOM and everything stopped working.
Obviously wave did not cause the issue but it facilitated/amplified it. It should be less aggressive in updating resources.
Potential solutions:
- Add a annotation to Deployments/DaemonSets which tracks the last update. Based on that we could enforce that at least X time has passed.
- Track last update inside the operator. This would reduce the number of API calls to the kubernetes api. However, it might cause issues when delays are long and the operator restarts
- We could also look at the number of ReplicaSets but that would be specific to Deployments and not work for other objects
How do we delay?
- Static delay of X seconds (i.e. 10s) -> We would have to only store the last restart timestamp
- Exponential backoff (i.e. 8s, 16s, 32s, 64s...)-> We would have to store last restart + backoff count
Metadata
Metadata
Assignees
Labels
No labels