Skip to content

helm: add optional VerticalPodAutoscaler for the operator Deployment#12565

Closed
QuentinBisson wants to merge 1 commit intostrimzi:mainfrom
QuentinBisson:feat/vpa
Closed

helm: add optional VerticalPodAutoscaler for the operator Deployment#12565
QuentinBisson wants to merge 1 commit intostrimzi:mainfrom
QuentinBisson:feat/vpa

Conversation

@QuentinBisson
Copy link
Copy Markdown
Contributor

Problem

The cluster operator has fixed resources.requests/limits in values.yaml, but the optimal values vary significantly with the number of Kafka clusters, topics, and users being managed. Without VPA, operators must tune resources manually based on observation, and OOMKill events are common in larger deployments.

Changes

Add an opt-in VerticalPodAutoscaler resource targeting the strimzi-cluster-operator Deployment.

  • Disabled by default (verticalPodAutoscaler.enabled: false) — no VPA CRDs are required in environments that don't use VPA
  • Configurable updateMode (Auto | Recreate | Initial | Off)
  • Controls both CPU and memory via controlledResources
verticalPodAutoscaler:
  enabled: true
  updateMode: "Auto"

Backwards compatibility

Disabled by default — no impact on existing installations.

The cluster operator has fixed resource requests and limits, but the
optimal values vary significantly depending on the number of Kafka
clusters and topics being managed. Without VPA, operators must tune
resources manually and reactively.

Add an opt-in VPA resource (disabled by default) that targets the
strimzi-cluster-operator Deployment. When enabled, VPA recommends and
optionally applies CPU/memory adjustments automatically.

Configuration:
  verticalPodAutoscaler:
    enabled: true          # requires VPA CRDs on the cluster
    updateMode: "Auto"     # Auto | Recreate | Initial | Off

Disabled by default to avoid requiring VPA CRDs in all environments.

Signed-off-by: QuentinBisson <quentin@giantswarm.io>
Copy link
Copy Markdown
Member

@scholzj scholzj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR. However - as mentioned in the previous PR(s) - before spamming us with 7 PRs for the same thing, you should ideally first talk with us about whether the things are actually considered useful and if we are interested in them. You should also check whether they should be opened as separate PRs or not.

Also, please make sure to use the PR template we use! And make sure to update the Helm Chart's README.md file with the new options!


I'm not sure this should be included. Vertical Pod Autoscaling of the Strimzi operator requires proper knowledge and experience of the operator. If you need it, I think you should add it yourself.

@QuentinBisson
Copy link
Copy Markdown
Contributor Author

The concern about operator-specific VPA knowledge is valid, which is exactly why I propose shipping this disabled by default (verticalPodAutoscaler.enabled: false). Users who enable it are opting in knowingly. This is the same pattern used for PodDisruptionBudget and NetworkPolicy in this very chart — both are off by default and both require platform knowledge to use correctly.

Happy to update the PR with clearer documentation around the risks if that helps.

@im-konge
Copy link
Copy Markdown
Member

im-konge commented Apr 2, 2026

Few things to this, I didn't use Vertical Pod Autoscaler that much, however I'm not sure we want to have something like this in the Helm charts. I think it also creates another path which is "supported" by us but not maintained and tested. If someone has such desire and need, they can create it without need of having it in the Helm chart. We are not using this in the regular YAML manifests and I think that our Helm charts just follow what we have there. So that's two things actually that would be - from my side - against adding it into the Helm charts.

Kafka clusters, topics, and users being managed.

Okay I take Kafka clusters, but KafkaTopics and KafkaUsers are managed by two different operators - and this autoscaling would not help in that areas.

OOMKill events are common in larger deployments

Are they common? I mean, users that know they will need more resources, they can change the value. The users should know what they are doing and what is needed.

Anyway, as I said, I don't like much the idea of having something that is not that maintained from our side (as we are not using it regularly) and not tested at all.

@scholzj
Copy link
Copy Markdown
Member

scholzj commented Apr 2, 2026

OOMKill events are common in larger deployments

Are they common? I mean, users that know they will need more resources, they can change the value. The users should know what they are doing and what is needed.

Anyway, as I said, I don't like much the idea of having something that is not that maintained from our side (as we are not using it regularly) and not tested at all.

I'm not sure VPA fixes any OOM, given Java's tendency to use any memory you throw at it. OOM is, in general, the result of some misconfiguration.

@QuentinBisson
Copy link
Copy Markdown
Contributor Author

I understand your concern# and I can definitely add it to my umbrella chart. I'll close this pull request then :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants