-
Notifications
You must be signed in to change notification settings - Fork 38
Description
Summary
We are self-hosting Kubefleet and have encountered critical issues impacting pod
creation and upgrade reliability. These issues hinder our ability to meet
operational requirements as a regulated industry. Below, we outline the challenges
and propose potential solutions.
Issues and Observations
1. Pod Creation Blocked Outside Fleet-* and Kube-* Namespaces
-
The
ValidatingWebhookConfigurationblocks the creation of pods in namespaces
other thanfleet-*andkube-*. -
As a regulated industry, we require agents and other critical workloads to run in
custom namespaces, but the webhook prevents this. -
Manual patching of the webhook (e.g., setting
failurePolicy: Ignore) is required
to unblock pod creation, which is not sustainable.As a workaround, we have resorted to creating more namespaces with
fleet-*and
kube-*prefixes. However, this approach is not scalable and introduces
operational complexity. For instance,CustomResourcePlacement(CRP) objects fail
to be created in the hub namespace (e.g.,hub-namespace) but are successfully
created in member namespaces (e.g.,member-namespace-1).Alternatively, if we use a
fleet-*namespace, the CRP throws the following error:This inconsistency leads to:
- Increased namespace sprawl, making cluster management more challenging.
- Potential conflicts with existing naming conventions and policies.
- Difficulty in maintaining compliance with regulatory requirements.
What do you propose as a sustainable solution to address this limitation, and what
are the potential implications of continuing with this workaround?
2. Upgrade Failures for Node Image and Kubernetes Upgrades
-
During node image and Kubernetes upgrades, we observe failures due to the
inability of the hub-agent or other pods to come up because of the
ValidatingWebhookConfiguration. -
The two alternatives we tried are as follows:
-
Webhook Configuration:
-
The
ValidatingWebhookConfigurationmust be manually patched after every
upgrade or node pool eviction to unblock pod creation. -
Manual changes are lost because the controller deletes and recreates the webhook
configuration using a delete-and-create pattern:if err := w.mgr.GetClient().Create(ctx, &validatingWebhookConfig); err != nil { if !apierrors.IsAlreadyExists(err) { return err } // Overwrite: delete and recreate if err := w.mgr.GetClient().Delete(ctx, &validatingWebhookConfig); err != nil { return err; } if err = w.mgr.GetClient().Create(ctx, &validatingWebhookConfig); err != nil { return err; } }
-
-
Multi-Replica Hub-Agent:
-
To improve high availability, we added a rolling update strategy, set replicas
to 5, and configured a PodDisruptionBudget. -
Certificates are hardcoded in the webhook server setup, as seen in
cmd/hubagent/main.go,
whereFleetWebhookCertDirandFleetWebhookPortare used without dynamic
synchronization. -
Each replica must use the same TLS certificate stored in
FleetWebhookCertDir.
If the certificates are not synchronized across replicas, the API server might
receive different certificates from different replicas, leading to TLS handshake
failures. -
When multiple replicas are behind a Kubernetes
Service, the API server's
requests are load-balanced across the replicas. If the replicas do not share the
same certificate, the API server might encounter inconsistent certificates
during successive requests. -
The API server logs errors like:
x509: certificate signed by unknown authorityThis disrupts the admission process, potentially causing delays or failures in
pod creation. -
The
caBundlefield in theValidatingWebhookConfigurationis critical for
ensuring the API server trusts the webhook service. For more details, refer to
the Kubernetes documentation on
Admission Webhooks.
-
-
Proposed Solutions
-
Persistent Webhook Configuration:
- Support persistent configuration for webhook fields (e.g.,
failurePolicy, ) to
avoid manual patching. - Can other namespaces be ignored to simplify webhook configuration and reduce
operational overhead?
- Support persistent configuration for webhook fields (e.g.,
-
Multi-Replica Hub-Agent:
- Address TLS handshake errors by:
- Using a shared certificate source (e.g., Kubernetes secrets) to prevent
mismatches.
- Using a shared certificate source (e.g., Kubernetes secrets) to prevent
- Wire the
replicasandstrategyfields into the hub-agent deployment and add
PodDisruptionBudgets for upgrade reliability.
- Address TLS handshake errors by:
-
Namespace Flexibility:
- Allow pod creation in custom namespaces by updating the webhook configuration to
support broader namespace scopes.
- Allow pod creation in custom namespaces by updating the webhook configuration to
Request and Questions
- Can you assist in resolving the TLS issues with multiple replicas to ensure
upgrades do not fail? - Would you be open to supporting persistent webhook configuration and namespace
flexibility?
