Skip to content

[BUG] Upgrade Failures: Lack of Persistent Webhook and Multi-Replica SupportΒ #1224

@Divya1388

Description

@Divya1388

Summary

We are self-hosting Kubefleet and have encountered critical issues impacting pod
creation and upgrade reliability. These issues hinder our ability to meet
operational requirements as a regulated industry. Below, we outline the challenges
and propose potential solutions.

Issues and Observations

1. Pod Creation Blocked Outside Fleet-* and Kube-* Namespaces

  • The ValidatingWebhookConfiguration blocks the creation of pods in namespaces
    other than fleet-* and kube-*.

  • As a regulated industry, we require agents and other critical workloads to run in
    custom namespaces, but the webhook prevents this.

  • Manual patching of the webhook (e.g., setting failurePolicy: Ignore) is required
    to unblock pod creation, which is not sustainable.

    As a workaround, we have resorted to creating more namespaces with fleet-* and
    kube-* prefixes. However, this approach is not scalable and introduces
    operational complexity. For instance, CustomResourcePlacement (CRP) objects fail
    to be created in the hub namespace (e.g., hub-namespace) but are successfully
    created in member namespaces (e.g., member-namespace-1).

    Alternatively, if we use a fleet-* namespace, the CRP throws the following error:

    • Image

    This inconsistency leads to:

    • Increased namespace sprawl, making cluster management more challenging.
    • Potential conflicts with existing naming conventions and policies.
    • Difficulty in maintaining compliance with regulatory requirements.

    What do you propose as a sustainable solution to address this limitation, and what
    are the potential implications of continuing with this workaround?

2. Upgrade Failures for Node Image and Kubernetes Upgrades

  • During node image and Kubernetes upgrades, we observe failures due to the
    inability of the hub-agent or other pods to come up because of the
    ValidatingWebhookConfiguration.

  • The two alternatives we tried are as follows:

    1. Webhook Configuration:

      • The ValidatingWebhookConfiguration must be manually patched after every
        upgrade or node pool eviction to unblock pod creation.

      • Manual changes are lost because the controller deletes and recreates the webhook
        configuration using a delete-and-create pattern:

        if err := w.mgr.GetClient().Create(ctx, &validatingWebhookConfig); err != nil {
              if !apierrors.IsAlreadyExists(err) {
                    return err
              }
              // Overwrite: delete and recreate
              if err := w.mgr.GetClient().Delete(ctx, &validatingWebhookConfig); err != nil {
                    return err;
              }
              if err = w.mgr.GetClient().Create(ctx, &validatingWebhookConfig); err != nil {
                    return err;
              }
        }
    2. Multi-Replica Hub-Agent:

      • To improve high availability, we added a rolling update strategy, set replicas
        to 5, and configured a PodDisruptionBudget.

      • Certificates are hardcoded in the webhook server setup, as seen in
        cmd/hubagent/main.go,
        where FleetWebhookCertDir and FleetWebhookPort are used without dynamic
        synchronization.

      • Each replica must use the same TLS certificate stored in FleetWebhookCertDir.
        If the certificates are not synchronized across replicas, the API server might
        receive different certificates from different replicas, leading to TLS handshake
        failures.

      • When multiple replicas are behind a Kubernetes Service, the API server's
        requests are load-balanced across the replicas. If the replicas do not share the
        same certificate, the API server might encounter inconsistent certificates
        during successive requests.

      • The API server logs errors like:

        x509: certificate signed by unknown authority
        

        This disrupts the admission process, potentially causing delays or failures in
        pod creation.

      • The caBundle field in the ValidatingWebhookConfiguration is critical for
        ensuring the API server trusts the webhook service. For more details, refer to
        the Kubernetes documentation on
        Admission Webhooks.

Proposed Solutions

  1. Persistent Webhook Configuration:

    • Support persistent configuration for webhook fields (e.g., failurePolicy, ) to
      avoid manual patching.
    • Can other namespaces be ignored to simplify webhook configuration and reduce
      operational overhead?
  2. Multi-Replica Hub-Agent:

    • Address TLS handshake errors by:
      • Using a shared certificate source (e.g., Kubernetes secrets) to prevent
        mismatches.
    • Wire the replicas and strategy fields into the hub-agent deployment and add
      PodDisruptionBudgets for upgrade reliability.
  3. Namespace Flexibility:

    • Allow pod creation in custom namespaces by updating the webhook configuration to
      support broader namespace scopes.

Request and Questions

  • Can you assist in resolving the TLS issues with multiple replicas to ensure
    upgrades do not fail?
  • Would you be open to supporting persistent webhook configuration and namespace
    flexibility?

Metadata

Metadata

Assignees

Labels

bugSomething isn't working

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions