Non-leader concierge pods permanently fail to load impersonation proxy TLS certificate (impersonation-proxy-serving-cert)

 **What happened?**

In multi-replica Pinniped Concierge deployments using the impersonation proxy (`mode: auto` on GKE), non-leader pods **permanently** fail to load the `impersonation-proxy-serving-cert` TLS certificate. The `DynamicServingCertificateController` retries every 60 seconds but never succeeds. This causes `tls: internal error` for clients when the LoadBalancer routes to a non-leader pod.

 **Important distinction**: There are two separate empty-certificate errors in the logs. Only one is a bug:

  | Certificate | Behavior | Bug? |
  |-------------|----------|------|
  | `concierge-serving-cert` | Transient at startup (<2 seconds), resolves once leader populates cert | **No** — normal startup race |
  | `impersonation-proxy-serving-cert` | **Persistent every 60 seconds on non-leader pods, never resolves** | **Yes — this is the bug** |


**Non-leader pod logs (persistent, never resolves):**

Repeats every 60 seconds indefinitely:

```
  {"level":"error","message":"Unhandled Error",
   "error":"key failed with : not loading an empty serving certificate from "impersonation-proxy-serving-cert""}
```

Repeats every ~3 minutes indefinitely:

```
  {"level":"error","message":"Unhandled Error",
   "error":"impersonator-config-controller: { } failed with:
     [...] write attempt rejected as client is not leader,
     failed to update CredentialIssuer status: [...] write attempt rejected as client is not leader"}
```

**Client-side error (intermittent, depends on LoadBalancer routing):**
```
  remote error: tls: internal error
```

**Leader pod works correctly** — acquires lease, loads certs, handles TokenCredentialRequests.


**What did you expect to happen?**

Non-leader pods should load the existing `impersonation-proxy-serving-cert` from the Kubernetes Secret (created by the leader) and serve TLS successfully. The `ensureTLSSecretIsCreatedAndLoaded()` function in `impersonator_config.go` already has a read-only path for this case — when the Secret exists, it reads from the informer cache (zero writes) and calls `loadTLSCertFromSecret()`. This path would succeed on non-leader pods, but it is never reached.

**What is the simplest way to reproduce this behavior?**

  1. Deploy Pinniped Concierge on GKE with `replicas: 2` and impersonation proxy `mode: auto`
  2. Wait for leader election to complete
  3. Check the non-leader pod logs:
     - `"impersonation-proxy-serving-cert"` errors every 60 seconds (persistent, never resolves)
     - `"write attempt rejected as client is not leader"` errors every ~3 minutes
  4. Run `kubectl` commands repeatedly — some succeed (hit leader), some fail with `tls: internal error` (hit non-leader)

**In what environment did you see this bug?**

  - Pinniped server version: v0.40.0 - (also with v0.44.0)
  - Pinniped client version: v0.44.0 
  - Pinniped container image: `ghcr.io/vmware/pinniped/pinniped-server:v0.40.0` (also with v0.44.0)
  - Pinniped configuration: OIDCIdentityProvider (Dex) → Pinniped Supervisor → Concierge JWTAuthenticator. Impersonation proxy mode: `auto`. Service type: `LoadBalancer` (Internal GKE).
  - Kubernetes version: GKE 1.31.x (managed control plane)
  - Cloud provider: Google Cloud (GKE)

  **What else is there to know about this bug?**

  ### Root Cause Analysis

The `doSync()` method in `internal/controller/impersonatorconfig/impersonator_config.go` executes steps sequentially. Service write operations execute **before** the TLS cert-loading step. On non-leader pods, the leader election middleware rejects the Service write with `ErrNotLeader`, causing `doSync()` to return early — the cert-loading code is never reached.

```
  Step 1: ensureImpersonatorIsStarted()          ✅ read-only, works on non-leader
  Step 2: ensureLoadBalancerIsStarted()           ❌ WRITES to Service → ErrNotLeader → RETURNS EARLY
  Step 3: ensureClusterIPServiceIsStarted()       ⛔ never reached
  Step 4: ensureCAAndTLSSecrets()                 ⛔ never reached ← cert loading here
           └─ ensureTLSSecretIsCreatedAndLoaded()
                └─ loadTLSCertFromSecret()        ⛔ never reached ← read-only, would succeed!
```

The `dynamiccert.Private` provider (`tlsServingCertDynamicCertProvider`) is entirely passive — no informer, no file watcher, no background goroutine. The **only** way it gets populated is via `SetCertKeyContent()` inside `loadTLSCertFromSecret()`, which is gated behind the failing write operations.

 On subsequent sync cycles, the informer re-enqueues the controller. But each retry hits the same early-return: Service update → `ErrNotLeader` → return. The cert-loading code is **never reached**, regardless of how many retries occur.


## Workaround

  Setting replicas: 1 on the pinniped-concierge Deployment eliminates the issue by ensuring only the leader pod exists. The startup-transient concierge-serving-cert errors still occur but resolve within seconds once the single pod acquires the lease.

Certificate	Behavior	Bug?
`concierge-serving-cert`	Transient at startup (<2 seconds), resolves once leader populates cert	No — normal startup race
`impersonation-proxy-serving-cert`	Persistent every 60 seconds on non-leader pods, never resolves	Yes — this is the bug

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Non-leader concierge pods permanently fail to load impersonation proxy TLS certificate (impersonation-proxy-serving-cert) #2933

Root Cause Analysis

Workaround

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Non-leader concierge pods permanently fail to load impersonation proxy TLS certificate (impersonation-proxy-serving-cert) #2933

Description

Root Cause Analysis

Workaround

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions