Skip to content

Non-leader concierge pods permanently fail to load impersonation proxy TLS certificate (impersonation-proxy-serving-cert) #2933

@sedflix

Description

@sedflix

What happened?

In multi-replica Pinniped Concierge deployments using the impersonation proxy (mode: auto on GKE), non-leader pods permanently fail to load the impersonation-proxy-serving-cert TLS certificate. The DynamicServingCertificateController retries every 60 seconds but never succeeds. This causes tls: internal error for clients when the LoadBalancer routes to a non-leader pod.

Important distinction: There are two separate empty-certificate errors in the logs. Only one is a bug:

Certificate Behavior Bug?
concierge-serving-cert Transient at startup (<2 seconds), resolves once leader populates cert No — normal startup race
impersonation-proxy-serving-cert Persistent every 60 seconds on non-leader pods, never resolves Yes — this is the bug

Non-leader pod logs (persistent, never resolves):

Repeats every 60 seconds indefinitely:

  {"level":"error","message":"Unhandled Error",
   "error":"key failed with : not loading an empty serving certificate from "impersonation-proxy-serving-cert""}

Repeats every ~3 minutes indefinitely:

  {"level":"error","message":"Unhandled Error",
   "error":"impersonator-config-controller: { } failed with:
     [...] write attempt rejected as client is not leader,
     failed to update CredentialIssuer status: [...] write attempt rejected as client is not leader"}

Client-side error (intermittent, depends on LoadBalancer routing):

  remote error: tls: internal error

Leader pod works correctly — acquires lease, loads certs, handles TokenCredentialRequests.

What did you expect to happen?

Non-leader pods should load the existing impersonation-proxy-serving-cert from the Kubernetes Secret (created by the leader) and serve TLS successfully. The ensureTLSSecretIsCreatedAndLoaded() function in impersonator_config.go already has a read-only path for this case — when the Secret exists, it reads from the informer cache (zero writes) and calls loadTLSCertFromSecret(). This path would succeed on non-leader pods, but it is never reached.

What is the simplest way to reproduce this behavior?

  1. Deploy Pinniped Concierge on GKE with replicas: 2 and impersonation proxy mode: auto
  2. Wait for leader election to complete
  3. Check the non-leader pod logs:
    • "impersonation-proxy-serving-cert" errors every 60 seconds (persistent, never resolves)
    • "write attempt rejected as client is not leader" errors every ~3 minutes
  4. Run kubectl commands repeatedly — some succeed (hit leader), some fail with tls: internal error (hit non-leader)

In what environment did you see this bug?

  • Pinniped server version: v0.40.0 - (also with v0.44.0)
  • Pinniped client version: v0.44.0
  • Pinniped container image: ghcr.io/vmware/pinniped/pinniped-server:v0.40.0 (also with v0.44.0)
  • Pinniped configuration: OIDCIdentityProvider (Dex) → Pinniped Supervisor → Concierge JWTAuthenticator. Impersonation proxy mode: auto. Service type: LoadBalancer (Internal GKE).
  • Kubernetes version: GKE 1.31.x (managed control plane)
  • Cloud provider: Google Cloud (GKE)

What else is there to know about this bug?

Root Cause Analysis

The doSync() method in internal/controller/impersonatorconfig/impersonator_config.go executes steps sequentially. Service write operations execute before the TLS cert-loading step. On non-leader pods, the leader election middleware rejects the Service write with ErrNotLeader, causing doSync() to return early — the cert-loading code is never reached.

  Step 1: ensureImpersonatorIsStarted()          ✅ read-only, works on non-leader
  Step 2: ensureLoadBalancerIsStarted()           ❌ WRITES to Service → ErrNotLeader → RETURNS EARLY
  Step 3: ensureClusterIPServiceIsStarted()       ⛔ never reached
  Step 4: ensureCAAndTLSSecrets()                 ⛔ never reached ← cert loading here
           └─ ensureTLSSecretIsCreatedAndLoaded()
                └─ loadTLSCertFromSecret()        ⛔ never reached ← read-only, would succeed!

The dynamiccert.Private provider (tlsServingCertDynamicCertProvider) is entirely passive — no informer, no file watcher, no background goroutine. The only way it gets populated is via SetCertKeyContent() inside loadTLSCertFromSecret(), which is gated behind the failing write operations.

On subsequent sync cycles, the informer re-enqueues the controller. But each retry hits the same early-return: Service update → ErrNotLeader → return. The cert-loading code is never reached, regardless of how many retries occur.

Workaround

Setting replicas: 1 on the pinniped-concierge Deployment eliminates the issue by ensuring only the leader pod exists. The startup-transient concierge-serving-cert errors still occur but resolve within seconds once the single pod acquires the lease.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions