-
Notifications
You must be signed in to change notification settings - Fork 79
Description
What happened?
In multi-replica Pinniped Concierge deployments using the impersonation proxy (mode: auto on GKE), non-leader pods permanently fail to load the impersonation-proxy-serving-cert TLS certificate. The DynamicServingCertificateController retries every 60 seconds but never succeeds. This causes tls: internal error for clients when the LoadBalancer routes to a non-leader pod.
Important distinction: There are two separate empty-certificate errors in the logs. Only one is a bug:
| Certificate | Behavior | Bug? |
|---|---|---|
concierge-serving-cert |
Transient at startup (<2 seconds), resolves once leader populates cert | No — normal startup race |
impersonation-proxy-serving-cert |
Persistent every 60 seconds on non-leader pods, never resolves | Yes — this is the bug |
Non-leader pod logs (persistent, never resolves):
Repeats every 60 seconds indefinitely:
{"level":"error","message":"Unhandled Error",
"error":"key failed with : not loading an empty serving certificate from "impersonation-proxy-serving-cert""}
Repeats every ~3 minutes indefinitely:
{"level":"error","message":"Unhandled Error",
"error":"impersonator-config-controller: { } failed with:
[...] write attempt rejected as client is not leader,
failed to update CredentialIssuer status: [...] write attempt rejected as client is not leader"}
Client-side error (intermittent, depends on LoadBalancer routing):
remote error: tls: internal error
Leader pod works correctly — acquires lease, loads certs, handles TokenCredentialRequests.
What did you expect to happen?
Non-leader pods should load the existing impersonation-proxy-serving-cert from the Kubernetes Secret (created by the leader) and serve TLS successfully. The ensureTLSSecretIsCreatedAndLoaded() function in impersonator_config.go already has a read-only path for this case — when the Secret exists, it reads from the informer cache (zero writes) and calls loadTLSCertFromSecret(). This path would succeed on non-leader pods, but it is never reached.
What is the simplest way to reproduce this behavior?
- Deploy Pinniped Concierge on GKE with
replicas: 2and impersonation proxymode: auto - Wait for leader election to complete
- Check the non-leader pod logs:
"impersonation-proxy-serving-cert"errors every 60 seconds (persistent, never resolves)"write attempt rejected as client is not leader"errors every ~3 minutes
- Run
kubectlcommands repeatedly — some succeed (hit leader), some fail withtls: internal error(hit non-leader)
In what environment did you see this bug?
- Pinniped server version: v0.40.0 - (also with v0.44.0)
- Pinniped client version: v0.44.0
- Pinniped container image:
ghcr.io/vmware/pinniped/pinniped-server:v0.40.0(also with v0.44.0) - Pinniped configuration: OIDCIdentityProvider (Dex) → Pinniped Supervisor → Concierge JWTAuthenticator. Impersonation proxy mode:
auto. Service type:LoadBalancer(Internal GKE). - Kubernetes version: GKE 1.31.x (managed control plane)
- Cloud provider: Google Cloud (GKE)
What else is there to know about this bug?
Root Cause Analysis
The doSync() method in internal/controller/impersonatorconfig/impersonator_config.go executes steps sequentially. Service write operations execute before the TLS cert-loading step. On non-leader pods, the leader election middleware rejects the Service write with ErrNotLeader, causing doSync() to return early — the cert-loading code is never reached.
Step 1: ensureImpersonatorIsStarted() ✅ read-only, works on non-leader
Step 2: ensureLoadBalancerIsStarted() ❌ WRITES to Service → ErrNotLeader → RETURNS EARLY
Step 3: ensureClusterIPServiceIsStarted() ⛔ never reached
Step 4: ensureCAAndTLSSecrets() ⛔ never reached ← cert loading here
└─ ensureTLSSecretIsCreatedAndLoaded()
└─ loadTLSCertFromSecret() ⛔ never reached ← read-only, would succeed!
The dynamiccert.Private provider (tlsServingCertDynamicCertProvider) is entirely passive — no informer, no file watcher, no background goroutine. The only way it gets populated is via SetCertKeyContent() inside loadTLSCertFromSecret(), which is gated behind the failing write operations.
On subsequent sync cycles, the informer re-enqueues the controller. But each retry hits the same early-return: Service update → ErrNotLeader → return. The cert-loading code is never reached, regardless of how many retries occur.
Workaround
Setting replicas: 1 on the pinniped-concierge Deployment eliminates the issue by ensuring only the leader pod exists. The startup-transient concierge-serving-cert errors still occur but resolve within seconds once the single pod acquires the lease.