Skip to content

⚠️ Split Helm chart into operator and providers charts with optional dependency #832

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 5 commits into
base: main
Choose a base branch
from

Conversation

kahirokunn
Copy link
Member

@kahirokunn kahirokunn commented Jun 19, 2025

Fixes: #534

What type of PR is this?

/kind bug

What this PR does / why we need it:

This PR fixes the flaky Helm installation issue where provider Custom Resources fail to install due to webhook validation errors. The root cause is that provider CRs are being applied at the same time as the operator deployment, before the webhook service is ready.

Problem:
When installing the cluster-api-operator Helm chart, users frequently encounter errors like:

Error: failed post-install: warning: Hook post-install cluster-api-operator/templates/core-conditions.yaml failed: 1 error occurred:
        * Internal error occurred: failed calling webhook "vcoreprovider.kb.io": failed to call webhook: Post "https://capi-operator-webhook-service.cluster-api-operator-docker.svc:443/mutate-operator-cluster-x-k8s-io-v1alpha2-coreprovider?timeout=10s": no endpoints available for service "capi-operator-webhook-service"

Solution:
Split the Helm chart into two separate charts:

  1. cluster-api-operator - Contains only the operator deployment and its resources (providers removed)
  2. cluster-api-operator-providers - Contains all provider Custom Resources with optional operator dependency

Which issue(s) this PR fixes:

Fixes #534

Special notes for your reviewer:

⚠️ BREAKING CHANGE: The cluster-api-operator chart no longer includes provider CRs. Users must now use the cluster-api-operator-providers chart to install providers.

For new users - Two-step installation (recommended, no errors):

helm install capi-operator capi-operator/cluster-api-operator \
  --create-namespace -n capi-operator-system --wait

helm install capi-providers capi-operator/cluster-api-operator-providers \
  -n capi-operator-system --set cluster-api-operator.install=false \
  --set infrastructure.docker.enabled=true

Single-step installation (backward compatibility, may require retry):

helm install capi-providers capi-operator/cluster-api-operator-providers \
  --create-namespace -n capi-operator-system \
  --set infrastructure.docker.enabled=true

# If above fails, retry with upgrade
helm upgrade --install capi-providers capi-operator/cluster-api-operator-providers \
  -n capi-operator-system --set infrastructure.docker.enabled=true

Release note:

BREAKING CHANGE: The cluster-api-operator Helm chart has been split into two charts. Provider CRs have been moved from `cluster-api-operator` to a new `cluster-api-operator-providers` chart. Existing users must migrate to use the new providers chart. The providers chart includes an optional dependency on the operator chart for easier installation.

@k8s-ci-robot k8s-ci-robot added the do-not-merge/invalid-commit-message Indicates that a PR should not merge because it has an invalid commit message. label Jun 19, 2025
Copy link

netlify bot commented Jun 19, 2025

Deploy Preview for kubernetes-sigs-cluster-api-operator ready!

Name Link
🔨 Latest commit 30820f6
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cluster-api-operator/deploys/6881f75e709d1400084a6b0c
😎 Deploy Preview https://deploy-preview-832--kubernetes-sigs-cluster-api-operator.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jun 19, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign alexander-demicev for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 19, 2025
@kahirokunn kahirokunn changed the title fix: split provider CRs from operator deployment 🐛 Fix flaky Helm installations by separating provider CRs from operator deployment Jun 19, 2025
@kahirokunn kahirokunn marked this pull request as draft June 19, 2025 07:37
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2025
@kahirokunn kahirokunn marked this pull request as ready for review June 19, 2025 07:41
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2025
@kahirokunn kahirokunn marked this pull request as draft June 19, 2025 09:03
@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2025
@kahirokunn kahirokunn marked this pull request as ready for review June 19, 2025 15:47
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jun 19, 2025
@k8s-ci-robot k8s-ci-robot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jun 20, 2025
@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch 9 times, most recently from 2ee1de5 to 51174e4 Compare June 20, 2025 05:35
@Danil-Grigorev
Copy link
Member

Hey @kahirokunn, can you PTAL at failing tests? Thanks

@kahirokunn
Copy link
Member Author

Thanks for taking a look!

Before I dive into fixing the failing tests, I wanted to check if this feature/approach is something you'd be interested in merging. I don't want to spend time on test fixes if the overall direction isn't what the project needs.

Would love to hear your thoughts on the concept first - any initial feedback would be super helpful!

@Danil-Grigorev
Copy link
Member

Danil-Grigorev commented Jul 22, 2025

It makes sense in general, but providers chart seem to be a better fit as an operator chart dependency. This, depending on the configuration, may make change non-breaking

@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch from 18080c7 to b54f2ad Compare July 23, 2025 04:00
@kahirokunn kahirokunn changed the title ⚠ Split Helm chart into operator and providers charts ⚠️ Split Helm chart into operator and providers charts with optional dependency Jul 23, 2025
@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch 8 times, most recently from b5a4c5c to 679bf1f Compare July 23, 2025 13:04
@kahirokunn
Copy link
Member Author

@Danil-Grigorev I tried changing it, but how does it look? 👀

@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch 5 times, most recently from 5b0b9a5 to 86cb427 Compare July 24, 2025 06:10
@k8s-ci-robot k8s-ci-robot added size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 24, 2025
@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch from 86cb427 to 1d115b9 Compare July 24, 2025 06:40
Separate cluster-api-operator into two charts to fix webhook timing issues:
- cluster-api-operator: operator deployment only
- cluster-api-operator-providers: provider Custom Resources with optional operator dependency

This ensures webhook readiness before applying provider CRs, preventing
"no endpoints available" errors during installation.

The providers chart includes cluster-api-operator as a conditional dependency
(install: true by default), maintaining backward compatibility while allowing
flexible deployment scenarios:

**Recommended: Two-step installation (no errors):**
```sh
helm install capi-operator capi-operator/cluster-api-operator \
  --create-namespace -n capi-operator-system --wait

helm install capi-providers capi-operator/cluster-api-operator-providers \
  -n capi-operator-system --set cluster-api-operator.install=false \
  --set infrastructure.docker.enabled=true
```

**Backward compatibility: Single-step installation (may require retry):**
```sh
helm install capi-providers capi-operator/cluster-api-operator-providers \
  --create-namespace -n capi-operator-system \
  --set infrastructure.docker.enabled=true

helm upgrade --install capi-providers capi-operator/cluster-api-operator-providers \
  -n capi-operator-system --set infrastructure.docker.enabled=true
```

Signed-off-by: kahirokunn <[email protected]>
Update hack/chart-update/main.go to process both cluster-api-operator
and cluster-api-operator-providers charts when updating index.yaml.
This ensures all charts are properly registered in the helm repository
index during the release process.

Signed-off-by: kahirokunn <[email protected]>
Providers can now define their own configSecret:
  core:
    cluster-api:
      configSecret:
        name: core-secret
        namespace: capi-system

If not specified, providers will use the global configSecret.

Signed-off-by: kahirokunn <[email protected]>
Add a new GitHub Actions workflow for smoke testing.

Signed-off-by: kahirokunn <[email protected]>
@kahirokunn kahirokunn force-pushed the fix-flaky-install2 branch from 1d115b9 to 486dade Compare July 24, 2025 06:52
…sues

This commit fixes intermittent test failures in the helm e2e tests caused by
inconsistent whitespace handling between Helm output and expected manifest files.

Signed-off-by: kahirokunn <[email protected]>
@kahirokunn
Copy link
Member Author

/test pull-cluster-api-operator-test-main

appVersion: "0.0.0"
dependencies:
- name: cluster-api-operator
repository: file://../cluster-api-operator
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It probably won’t work outside of locally built chart. On the second thought, having charts separated would be simpler from release perspective

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Helm Chart CRDs Placement Causes Flaky Installations
3 participants