-
Notifications
You must be signed in to change notification settings - Fork 94
Add troubleshooting of guest cluster LB IP is not reachable #909
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
e3df4c3 to
da9a827
Compare
Signed-off-by: Jian Wang <[email protected]>
|
ihcsim
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Content LGTM.
| Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode. | ||
| - Modifying the `IPAM` mode isn't allowed. You must create a new service if you intend to change the `IPAM` mode. | ||
| - Refer to [Guest Cluster Loadbalancer IP is not reachable](../troubleshooting/rancher.md#guest-cluster-loadbalancer-ip-is-not-reachable). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we state which version is affected by this? AIUI, it's cloud provider 107.0.1+up0.2.10 on Rancher 2.12 where kube-vip < v0.9.1, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I know this has been seen definitely on:
- harvester-cloud-provider:0.2.1000
- harvester-cloud-provider:0.2.1100
x-ref:
It may... "may"... have also briefly presented itself on an older maybe 0.2.900 version... but I can't find breadcrumbs at the moment that would lead me to a definitive "yes" on that...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bug from calico is causing the issue, I will check calico code to mention the affected guest cluster versions like RKE2 v1.33.5 +rke2r1.
martindekov
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have much context so no intuition on what is what I think reviewing with that lens is helpful If I don't know what is going on and try to fix it. Added comments mainly around gray areas where I was asking myself - where should we run the command? where would this pop up to edit it? in places where it wasn't clear to me.
Overall LGTM though, thanks for the work Jian! Will do a follow up once you respond.
| ### Root Cause | ||
| In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
At the end of it, the command below where we run ip -d link show dev vxlan.calico can we explain in last sentence from what context the command should be executed? Harvester VM IP is 10.115.1.46 do you get a shell session inside before running the ip? Load balancer IP is unreachable so I'd suggest elaborating:
| In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. | |
| In below example, the guest cluster node(Harvester VM)'s IP is `10.115.1.46`, and later a new Loadbalancer IP `10.115.6.200` is added to a new interface like `vip-fd8c28ce (@enp1s0)`. However, the Loadbalancer IP is taken over by the `calio` controller. It caused the Loadbalancer IP is not reachable. Through a shell session using the original IP run the following. |
This might not be right, but context from which ip would be valid would make things clear
|
|
||
| For exsting clusters, run command `$ kubectl edit installation`, go to `.spec.calicoNetwork.nodeAddressAutodetectionV4`, remove any existing line like `firstFound: true`, add new line `skipInterface: vip.*` and save. | ||
|
|
||
| Wait a while, the daemonset `calico-system/calico-node` is rolling updated and then the related PODs take the node IP for VXLAN to use. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Similar to above, from whet context would I run the ip below. Small sentence at the end would be helpful
| The Loadbalancer IP is reachable again. | ||
|
|
||
|
|
||
| When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would a YAML window pop up when clicking on Add-on: Calico in which yaml we edit the below suggestion?
| When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally. | |
| When creating new clusters on `Rancher Manager`, click **Add-on: Calico**, YAML configuration window will appear. Add following two lines to `.installation.calicoNetwork`. The `calico` controller won't take over the Loadbalancer IP accidentally. |
Not sure whether YAML config window will appear or we edit through kubectl client object against the k8s so adding a short sentence would be helpful IMHO.
|
Having a similar issue here to add another use case for study: harvester/harvester#9479. I can't get the LB on Harvester to assign a different CIDR, Virtual Machine Network or VLAN. It always defaults to the Cluster Management network. |
Problem:
LB on guest cluster is not working as expected sometimes.
Solution:
Update the related root cause and workaround to document.
Related Issue(s):
harvester/harvester#8072
Test plan:
Additional documentation or context
After the review on main is done, I will copy them to v1.5, v1.6 branches, to save time of reviewing and updating, thanks.