Skip to content

✨ Bring your own network #1472

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 10 commits into
base: main
Choose a base branch
from

Conversation

johannesfrey
Copy link
Contributor

@johannesfrey johannesfrey commented Aug 31, 2024

What this PR does / why we need it:
This PR makes it possible to "adopt" a pre-existing network by passing its ID to hetznerCluster.spec.hcloudNetwork.id instead of the network being created during cluster creation. Furthermore, during cluster deletion it only deletes the attached network if it does not have the owned label attached to it (currently the only way here to discriminate between a CAPH-managed network and an unmanaged one).

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #762

Special notes for your reviewer:
This has been lingering around for a while untouched on my fork and I decided to rebase it onto the current main branch. Please consider this as a first attempt to approach this topic as a whole. I also tried to already add some unit tests. I guess it also might require some e2e tests!? No idea if this is the desired way to do this and about other side-effects I did not see. So looking forward for feedback or any pointers. And also feel free to push changes to the PR, as I'll be pretty occupied with other things almost the whole September. Just wanted to push this out there already for you to take a look at 🙂

The most controversial changes so far:

  • Making hcloudNetwork.id mutually exclusive with cidrBlock, subnetCidrBlock and networkZone
  • Replacing kubebuilder defaulting/validation special tags with custom defaulting/validation, which makes the webhook more complex
  • Changing cidrBlock, subnetCidrBlock and networkZone to be pointers (I guess this could also be done with empty strings, but pointers make it possible to be not shown at all, when not provided)
  • Labels are shown in the NetworkStatus in order to check if it's managed or unmanaged

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

TODOs:

  • squash commits
  • include documentation
  • add unit tests

Copy link
Contributor

@janiskemper janiskemper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general this approach seems good to me. Thanks a lot for this contribution!

@guettli @batistein what's your opinion?

Copy link
Contributor

@janiskemper janiskemper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot again for this PR @johannesfrey. I think we can merge it if you follow the suggestions I gave. It's really good work!

@johannesfrey johannesfrey force-pushed the bring-your-own-network branch from 3f6b8f6 to b27b859 Compare November 13, 2024 17:07
@johannesfrey johannesfrey marked this pull request as ready for review November 13, 2024 17:08
@johannesfrey
Copy link
Contributor Author

johannesfrey commented Nov 13, 2024

Thanks a lot again for this PR @johannesfrey. I think we can merge it if you follow the suggestions I gave. It's really good work!

Sorry for the long delay 🙏 . Thx for the reviews! I hope I addressed your suggestions correctly. PTAL. Thx!

@syself-bot syself-bot bot added area/test Changes made in the test directory area/code Changes made in the code directory area/api Changes made in the api directory size/L Denotes a PR that changes 200-800 lines, ignoring generated files. labels Nov 13, 2024
Copy link
Contributor

@janiskemper janiskemper left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @johannesfrey ! I went another time over the details and found a few things.

@syself-bot syself-bot bot added size/XL Denotes a PR that changes 800-2000 lines, ignoring generated files. and removed size/L Denotes a PR that changes 200-800 lines, ignoring generated files. labels Nov 24, 2024
@johannesfrey johannesfrey force-pushed the bring-your-own-network branch from bdf6ca4 to 4806de2 Compare November 24, 2024 19:03
@janiskemper
Copy link
Contributor

thanks @johannesfrey! Do you think anything is missing right now? If not, I'd propose the following path:

  • I will do a final review
  • You squash and rebase
  • Someone from our team will again do a functionality test (which I haven't done at all) that the actual behavior is as expected

@johannesfrey
Copy link
Contributor Author

johannesfrey commented Nov 26, 2024

thanks @johannesfrey! Do you think anything is missing right now? If not, I'd propose the following path:

* I will do a final review

* You squash and rebase

* Someone from our team will again do a functionality test (which I haven't done at all) that the actual behavior is as expected

That sounds awesome. Thx!
One thing that would be great to take a look at is my (probably too naive) way of testing the feature in controllers/hetznercluster_controller_test.go. There seems to be a data race, especially in the first test, where the cluster should attach to the previously created network. Most of the time it succeeds but there are cases where the reconciler errors that the requested network cannot be found while passing in the id. I'm struggling to see why this happens because the line of execution should look like this (ginkgo running serially):
create network with fake client -> check that there is no error -> add the id to the hetznercluster spec -> trigger a create of the hetznercluster -> wait until the hetznercluster ist ready and has the correct network condition and status

The reconciler should then use its internal client (which should be the shared fake one from before) to find the network. But it cannot find it, so there must be something in the line that deleted it or there is some race when using the fake client and probably the usage of the mutexes in there?! I tried some variations of changing the locks in there, but to no avail. So wasn't able to really deflake the test. So, would be really cool if you could take another look there. And if the test makes more harm than that it's helping, we could also think of removing/chaning it. WDYT?

@janiskemper
Copy link
Contributor

mmh that's an important observation, thanks @johannesfrey. We will have a look. I'm not able to see anything in the code right now.

@bitnik
Copy link

bitnik commented Feb 5, 2025

Hello, what is the plan here? Is this PR still considered?

@batistein
Copy link
Contributor

we just moved this PR into testing.

@tcldr
Copy link

tcldr commented Mar 27, 2025

Looking forward to trying this out. I'm planning to use an existing private network with a NAT Gateway/Bastion Host.

One big advantage of a private network is that it doesn't incur overage fees when you take the 10Gbps option.

@batistein
Copy link
Contributor

@tcldr this implementation only covers hcloud not vswitch. Also internal routing even via public IPs is not incurring any costs.

@tcldr
Copy link

tcldr commented Mar 31, 2025

@batistein good to know, thanks.

Is there something particular blocking vswitch enabled networks and subnets?

@batistein
Copy link
Contributor

They are basically unstable, and we removed, therefore, for all our customers, the support of private networks 3 years ago. Since then, the instability on hetzner side are not resolved that's why we never invested time to support them via caph. As Syself we switched to a zero trust architecture which aligns also more with our future plans. See: https://syself.com/docs/hetzner/apalla/platform/zero-trust

@tcldr
Copy link

tcldr commented Mar 31, 2025

They are basically unstable, and we removed, therefore, for all our customers, the support of private networks 3 years ago.

That's concerning, I'm only just experimenting with this feature now. Do you have particular examples? Would you be open to PRs that adds support so that CAPH provides the option for those who are self-managing?

@batistein
Copy link
Contributor

Yes, of course we are open to PRs, these need to be E2E tested and the logic needs to be separated so it doesn't affect the current code.
But believe me, this is wasted time. I spoke to a customer yesterday who is migrating to Syself with over hundreds of servers, and they have network outages on a daily basis because they are using vswitch in their self-managed Kubernetes environment... I don't think this is just a coincidence, without vswitch and a private network there is no noticeable problem at all. Also, the server limitations of the Hetzner network make it unsuitable for enterprise use. I've seen the team do the craziest things with connecting multiple networks to get around the limitations, ending up spending months on a topology only to realise it doesn't work.
If you're worried about privacy, use mTLS and strict network policies.

@bitnik
Copy link

bitnik commented Apr 1, 2025

We were also waiting for this PR, but meanwhile we tried out hetzner private network and experienced similar issues as @batistein told about. So we decided not to use it neither and go zero-trust way.

@tcldr
Copy link

tcldr commented Apr 2, 2025

Thanks for the heads-up @batistein and @bitnik ! Lots to consider there. Appreciate you both taking the time to share your thoughts.

@lkt82
Copy link

lkt82 commented May 9, 2025

Any news on this PR?

@johannesfrey
Copy link
Contributor Author

johannesfrey commented May 11, 2025

@batistein Thx for all the context around your past experiences with private networks.

At the time I initially created this PR I was not aware that you have been dropping support for those already 3 years ago (at least for the customers you have been interacting with). So I'm a bit reluctant to "burden" you with even more private network functionality. So if the recommended way is zero-trust, I would be totally fine to close this PR. So I guess we would "just" need a decision how to proceed here (with the option to "close and forget" being totally fine form my side 😉). And I guess this would also relieve all others waiting for this eventually to be merged.

Also @janiskemper thx so much for your effort reviewing all of this.

@janiskemper
Copy link
Contributor

@johannesfrey IMO everyone should decide that on their own. We don't use Hetzner's private networks and have reasons to not do it. However, others might come to different conclusions. I'd very much like to get this merged. We are very busy right now internally though - that's why the testing didn't happen yet. I recognize the interest into this and put it up on our agenda. It will still take some time though, I'm afraid. However, that's a good thing because in this way we will be able to ensure the quality of CAPH, even if it is a bit slower.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/api Changes made in the api directory area/code Changes made in the code directory area/test Changes made in the test directory size/XL Denotes a PR that changes 800-2000 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make it possible to use a pre-created private network
7 participants