Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets

We experienced a recent cluster control-plane outtage with wave as part of the event chain.

Root Cause: A separate controller would constantly update a secret (due to a bug). Shortly after that wave would updated the Deployment which would in turn create a new ReplicaSet. Since wave would update the Deployment faster than the pods could get ready Kubernetes would keep all ReplicaSets. At around 6k ReplicaSets our Kubernetes API servers went OOM and everything stopped working.

Obviously wave did not cause the issue but it facilitated/amplified it. It should be less aggressive in updating resources.

Potential solutions:
* Add a annotation to Deployments/DaemonSets which tracks the last update. Based on that we could enforce that at least X time has passed.
* Track last update inside the operator. This would reduce the number of API calls to the kubernetes api. However, it might cause issues when delays are long and the operator restarts
* We could also look at the number of ReplicaSets but that would be specific to Deployments and not work for other objects

How do we delay?
* Static delay of X seconds (i.e. 10s) -> We would have to only store the last restart timestamp
* Exponential backoff (i.e. 8s, 16s, 32s, 64s...)-> We would have to store last restart + backoff count


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets #182

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets #182

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions