Skip to content

Wave is very fast to update Deployments and can DDoS the kubernetes API with a lot of ReplicaSets #182

@jabdoa2

Description

@jabdoa2

We experienced a recent cluster control-plane outtage with wave as part of the event chain.

Root Cause: A separate controller would constantly update a secret (due to a bug). Shortly after that wave would updated the Deployment which would in turn create a new ReplicaSet. Since wave would update the Deployment faster than the pods could get ready Kubernetes would keep all ReplicaSets. At around 6k ReplicaSets our Kubernetes API servers went OOM and everything stopped working.

Obviously wave did not cause the issue but it facilitated/amplified it. It should be less aggressive in updating resources.

Potential solutions:

  • Add a annotation to Deployments/DaemonSets which tracks the last update. Based on that we could enforce that at least X time has passed.
  • Track last update inside the operator. This would reduce the number of API calls to the kubernetes api. However, it might cause issues when delays are long and the operator restarts
  • We could also look at the number of ReplicaSets but that would be specific to Deployments and not work for other objects

How do we delay?

  • Static delay of X seconds (i.e. 10s) -> We would have to only store the last restart timestamp
  • Exponential backoff (i.e. 8s, 16s, 32s, 64s...)-> We would have to store last restart + backoff count

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions