Skip to content

Conversation

@w13915984028
Copy link
Member

@w13915984028 w13915984028 commented Oct 28, 2025

Problem:

LB on guest cluster is not working as expected sometimes.

Solution:

Update the related root cause and workaround to document.

Related Issue(s):

harvester/harvester#8072

Test plan:

Additional documentation or context

After the review on main is done, I will copy them to v1.5, v1.6 branches, to save time of reviewing and updating, thanks.

@github-actions
Copy link

Name Link
🔨 Latest commit c28b38c
😎 Deploy Preview https://6900edf345c8fa1134eb1777--harvester-preview.netlify.app

Copy link
Contributor

@ihcsim ihcsim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Content LGTM.

Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode.
- Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode.
- Refer to [Guest Cluster Loadbalancer IP is not reachable](../troubleshooting/rancher.md#guest-cluster-loadbalancer-ip-is-not-reachable).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we state which version is affected by this? AIUI, it's cloud provider 107.0.1+up0.2.10 on Rancher 2.12 where kube-vip < v0.9.1, right?

Copy link
Contributor

@irishgordo irishgordo Oct 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I know this has been seen definitely on:

  • harvester-cloud-provider:0.2.1000
  • harvester-cloud-provider:0.2.1100

x-ref:

It may... "may"... have also briefly presented itself on an older maybe 0.2.900 version... but I can't find breadcrumbs at the moment that would lead me to a definitive "yes" on that...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The bug from calico is causing the issue, I will check calico code to mention the affected guest cluster versions like RKE2 v1.33.5 +rke2r1.

Copy link
Contributor

@martindekov martindekov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have much context so no intuition on what is what I think reviewing with that lens is helpful If I don't know what is going on and try to fix it. Added comments mainly around gray areas where I was asking myself - where should we run the command? where would this pop up to edit it? in places where it wasn't clear to me.

Overall LGTM though, thanks for the work Jian! Will do a follow up once you respond.

### Root Cause
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

At the end of it, the command below where we run ip -d link show dev vxlan.calico can we explain in last sentence from what context the command should be executed? Harvester VM IP is 10.115.1.46 do you get a shell session inside before running the ip? Load balancer IP is unreachable so I'd suggest elaborating:

Suggested change
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable.
In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. Through a shell session using the original IP run the following.

This might not be right, but context from which ip would be valid would make things clear


For exsting clusters, run command `$ kubectl edit installation`, go to `.spec.calicoNetwork.nodeAddressAutodetectionV4`, remove any existing line like `firstFound: true`, add new line `skipInterface: vip.*` and save.

Wait a while, the daemonset `calico-system/calico-node` is rolling updated and then the related PODs take the node IP for VXLAN to use.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similar to above, from whet context would I run the ip below. Small sentence at the end would be helpful

The Loadbalancer IP is reachable again.


When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would a YAML window pop up when clicking on Add-on: Calico in which yaml we edit the below suggestion?

Suggested change
When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally.
When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, YAML configuration window will appear. Add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally.

Not sure whether YAML config window will appear or we edit through kubectl client object against the k8s so adding a short sentence would be helpful IMHO.

@derhornspieler
Copy link

derhornspieler commented Nov 13, 2025

Having a similar issue here to add another use case for study: harvester/harvester#9479. I can't get the LB on Harvester to assign a different CIDR, Virtual Machine Network or VLAN. It always defaults to the Cluster Management network.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants