Skip to content

🐛 Fix panic when OpenStackCluster.Status.Network is nil in HCP scenarios #2635

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

bnallapeta
Copy link
Contributor

What this PR does / why we need it:
Fixes a nil pointer panic in OpenStackMachineReconciler when OpenStackCluster.Status.Network is nil, which occurs in Hosted Control Plane scenarios. The controller now gracefully handles missing cluster network by checking for nil before access and returning a terminal error instead of panicking. Also adds comprehensive HCP E2E test suite.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #2380

TODOs:

  • squashed commits
  • if necessary:
    • includes documentation
    • adds unit tests

/hold

bnallapeta and others added 9 commits July 7, 2025 16:35
Signed-off-by: Bharath Nallapeta <[email protected]>
Signed-off-by: Bharath Nallapeta <[email protected]>
Signed-off-by: Bharath Nallapeta <[email protected]>
…e/data/ccm/cloud-controller-manager.yaml to use cloud.conf with [Global] and [LoadBalancer] sections, addressing "expected section header" error.

- Modified test/e2e/suites/hcp/hcp_helpers.go to align with HCP test setup.
- Updated Makefile to support HCP test execution.
- Ensured control plane provider is named "kubeadm" to avoid "invalid config: control-plane-provider should be named kubeadm" error for Kamaji (v0.15.3).
Signed-off-by: Bharath Nallapeta <[email protected]>
@k8s-ci-robot k8s-ci-robot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Aug 3, 2025
@k8s-ci-robot k8s-ci-robot requested a review from lentzi90 August 3, 2025 05:22
Copy link

netlify bot commented Aug 3, 2025

Deploy Preview for kubernetes-sigs-cluster-api-openstack ready!

Name Link
🔨 Latest commit 0ab4e3c
🔍 Latest deploy log https://app.netlify.com/projects/kubernetes-sigs-cluster-api-openstack/deploys/6890c78c90ddb2000887d121
😎 Deploy Preview https://deploy-preview-2635--kubernetes-sigs-cluster-api-openstack.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify project configuration.

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please assign justinsb for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot requested a review from mdbooth August 3, 2025 05:22
@bnallapeta bnallapeta marked this pull request as draft August 3, 2025 05:22
Copy link

linux-foundation-easycla bot commented Aug 3, 2025

CLA Missing ID CLA Not Signed

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Aug 3, 2025
@k8s-ci-robot
Copy link
Contributor

PR needs rebase.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 3, 2025
@k8s-ci-robot
Copy link
Contributor

Hi @bnallapeta. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Aug 3, 2025
@bnallapeta
Copy link
Contributor Author

@EmilienM @mdbooth @lentzi90

Marking it as Draft as we are working on the e2e tests. This is the approach being taken for e2e. Let me know your thoughts.

  1. Spawn up a kind cluster -> turn it into a management cluster by installing CAPI/CAPO on it
  2. Deploy a k8s cluster on OpenStack using this kind cluster. Let's call this Cluster A
  3. Turn Cluster A into management cluster by installing CAPI/CAPO on it
  4. Deploy a k8s cluster in the hosted control plane method using kamaji provider using cluster A. Let's call this Cluster B
  5. Run the actual e2e tests to test out the panic case on Cluster B

Right now, we are somewhere in the 3rd/4th step and are facing the below challenges:

First, when we specify our Kamaji as a control plane provider in e2e_conf.yaml and name it as kamaji this error pops up:

[FAILED] The e2e test config file is not valid
Expected success, but got an error:
    <*errors.fundamental | 0xc000c00228>:
    invalid argument: invalid config: control-plane-provider should be named kubeadm

Second, if we try to bypass and name this kamaji as kubeadm, kamaji doesn't get installed as such and later on, we face issues with finding Kamaji CRDs

[FAILED] in [It] - /root/cluster-api-provider-openstack/test/e2e/suites/hcp/hcp_helpers.go:148 @ 07/31/25 11:01:25.32 < Exit [It] should create and manage HCP-capable cluster @ 07/31/25 11:01:25.32 (7m12.465s) << Timeline [FAILED] Timed out after 180.000s. Expected success, but got an error: <*apiutil.ErrResourceDiscoveryFailed | 0xc000123428>: unable to retrieve the complete list of server APIs: [kamaji.clastix.io/v1alpha1](http://kamaji.clastix.io/v1alpha1): no matches for [kamaji.clastix.io/v1alpha1](http://kamaji.clastix.io/v1alpha1), Resource= { { Group: "[kamaji.clastix.io](http://kamaji.clastix.io/)", Version: "v1alpha1", }: <*meta.NoResourceMatchError | 0xc0012143c0>{ PartialResource: { Group: "[kamaji.clastix.io](http://kamaji.clastix.io/)", Version: "v1alpha1", Resource: "", }, }, } In [It] at: /root/cluster-api-provider-openstack/test/e2e/suites/hcp/hcp_helpers.go:148 @ 07/31/25 11:01:25.32

Would really appreciate your help on this to move forward.

cc @orkhan-os

Copy link
Contributor

@mdbooth mdbooth left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Appreciate this is just a draft, but I had a quick look over it anyway.

Comment on lines 588 to 596
var defaultNetworkID string
if openStackCluster.Status.Network != nil {
defaultNetworkID = openStackCluster.Status.Network.ID
}

// If no cluster network is available AND the machine spec did not define any ports with a network, we cannot choose a network.
if defaultNetworkID == "" && len(openStackMachine.Spec.Ports) == 0 {
return nil, capoerrors.Terminal(infrav1.InvalidMachineSpecReason, "no network configured: cluster network is missing and machine spec does not define ports with a network")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I feel like this splits this logic across this function and openStackMachineSpecToOpenStackServerSpec. Did you consider putting this logic in openStackMachineSpecToOpenStackServerSpec and modifying its signature to return an error?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 213 to 216
defaultNetID := ""
if openStackCluster.Status.Network != nil {
defaultNetID = openStackCluster.Status.Network.ID
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More evidence of what I was saying above: this duplicates part of the functionality in the test. If we did this in openStackMachineSpecToOpenStackServerSpec we could just test it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines +568 to +572
// Handle HTTP 409 (SecurityGroupRuleExists) as success - the rule already exists
if strings.Contains(err.Error(), "SecurityGroupRuleExists") || strings.Contains(err.Error(), "already exists") {
s.scope.Logger().V(4).Info("Security group rule already exists, treating as success", "description", r.Description, "securityGroupID", securityGroupID)
return nil
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks unrelated? Separate PR, perhaps?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I came across this issue while I was testing the PR. It won't work without this change.

But sure, I can open another one and then rebase this later on.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
Status: Inbox
Development

Successfully merging this pull request may close these issues.

Panic in OpenStackMachineReconciler if OpenStackCluster.Status.Network is nil (Hosted Control Plane scenario)
3 participants