[DRAFT LFX] Refactor Webhook Certs to Secrets & Persist Network Allocations via CRD#231
[DRAFT LFX] Refactor Webhook Certs to Secrets & Persist Network Allocations via CRD#231ballista01 wants to merge 1 commit intoopenkruise:masterfrom
Conversation
feat(network): Introduce NetworkAllocation CRD for port persistence
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
Welcome @ballista01! It looks like this is your first PR to openkruise/kruise-game 🎉 |
|
@ballista01: PR needs rebase. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
|
@ballista01 Thank you for your contribution. Your proposal is inspiring. However, CRDs are not suitable for network plugins because different cloud providers have different parameters. I' m appreciate for your PR. If you are still interested to OKG, you can work on other issues. Leave your email please, and I will contact to you. |
|
@chrisliu1995 Hi, thanks for the explanation. Here's my email: ballista01@outlook.com. |
Addresses:
Motivation / Problem Statement:
This PR tackles two key challenges to enhance the robustness and scalability of the Kruise Game controller manager:
kruise-game-controller-managerdeployment, each replica generating its own self-signed certificate can lead to TLS verification errors (x509: certificate signed by unknown authority). This occurs because the Kubernetes API server, when calling the webhook service, might be routed to different pods with different, untrusted CAs.Proposed Changes:
This PR introduces the following key changes:
Webhook Certificate Management via Kubernetes Secrets:
FSCertWriter) to using aSecretCertWriter. This new writer ensures that the TLS certificate and private key for the webhook server are generated and stored within a Kubernetes Secret (e.g.,webhook-server-certin thekruise-game-systemnamespace).kruise-game-controller-managerwill use the exact same TLS certificate. This allows the Kubernetes API server'scaBundle(for the webhook configurations) to consistently trust all webhook server instances.Introduction of
NetworkAllocationCRD for Port Persistence:NetworkAllocation(game.kruise.io/v1alpha1), has been introduced.NetworkAllocationSpecdefines the desired allocation, includingLbID,Port,Protocol, and aPodReflinking to the Pod that owns the allocation.NetworkAllocationCR instance when a port is successfully allocated to a Pod.NetworkAllocationCR when the port is deallocated.kubectl get networkallocations.Key Benefits:
kruise-game-controller-manager.Limitations and Known Issues (Work in Progress):
Network Allocation Decision Logic (Multi-Replica Safety):
NetworkAllocationCRD, the port allocation decision-making logic within the cloud provider plugins (e.g., the in-memoryc.cacheinjdcloud/nlb.goandtencentcloud/clb.go) is not yet fully multi-replica safe.NetworkAllocationCRD Scalability:NetworkAllocationCR per allocated port. For Pods requiring numerous ports or in large-scale deployments, this could result in a high volume of CRs, potentially impacting API server and etcd performance.Certificate Lifecycle Management:
SecretCertWriterensures consistent certificate generation and storage in a Secret. However, for full lifecycle management, including automated rotation and integration with trusted CAs (like Let's Encrypt), further integration with a tool likecert-managerwould be beneficial.Future Work and Proposed Next Steps:
NetworkAllocationCRs (e.g., by listing them) to determine port availability before making an allocation.NetworkAllocationCRD:NetworkAllocationCR (e.g., by makingspec.portsa list) to reduce the overall number of CR instances.OwnerReferences:NetworkAllocationCRs haveOwnerReferencesset to their respectivePodobjects to enable automatic garbage collection when Pods are deleted. (Note:PodRefis in the spec, but explicitmetadata.ownerReferencesis needed for GC).cert-managerIntegration:SecretCertWriterto leveragingcert-managerfor managing the lifecycle of webhook TLS certificates, including automated provisioning and rotation.NetworkAllocationStatus:NetworkAllocationStatusto reflect the actual state, potential conflicts, or last validation time.Request for Feedback & Collaboration Offer:
This is a draft PR intended to showcase my understanding of the issues and a potential direction for solutions, particularly in the context of my LFX Mentorship application for OpenKruiseGame.
I have put my thought into this approach and its implications. If the community or mentors are interested, I would be very happy to draft a more detailed design document covering the architecture, API considerations, alternative approaches considered, and a long-term roadmap for these features. I'm eager to share this in any relevant working group or community forum to gather broader feedback and refine the solution collaboratively.
I welcome all feedback, critiques, and suggestions on this initial approach. Thank you for your consideration!