-
Notifications
You must be signed in to change notification settings - Fork 36
Operator cold-start reconciliation strips Konnect env vars from running DataPlane when KonnectExtension is temporarily not Ready #3714
Description
Current Behavior
When the operator starts with a cold cache (upgrade, restart, crash recovery) and the KonnectExtension is not Ready at the time of first reconciliation, the DataPlane controller rebuilds the deployment spec without Konnect env vars and applies it. This triggers a rolling update to pods that have no Konnect connectivity.
When the extension later becomes Ready, the operator re-adds the Konnect env vars, causing a second rolling update. The DataPlane gets bounced twice and loses its Konnect connection for the duration of the gap.
On a warm restart (operator already running with populated caches), the same condition (extension temporarily not Ready) does NOT modify the existing deployment. The cold-start path lacks this protection.
The Konnect env vars that get stripped:
KONG_CLUSTER_CONTROL_PLANEKONG_CLUSTER_MTLSKONG_CLUSTER_SERVER_NAMEKONG_CLUSTER_TELEMETRY_ENDPOINTKONG_CLUSTER_TELEMETRY_SERVER_NAMEKONG_INCREMENTAL_SYNCKONG_KONNECT_MODEKONG_LUA_SSL_TRUSTED_CERTIFICATEKONG_ROLEKONG_VITALS
Expected Behavior
When the operator cold-starts and the KonnectExtension is not Ready, the operator should NOT strip existing Konnect env vars from a running DataPlane deployment. Instead it should either:
- Preserve the existing deployment spec and set a status condition on the DataPlane (e.g.
KonnectExtensionReady=False) indicating the extension is unavailable, then reconcile once it becomes Ready again,
OR
- Return early / requeue without modifying the deployment, similar to the warm-restart behavior
A temporarily unavailable KonnectExtension should never cause a healthy, connected DataPlane to lose its Konnect connection.
Steps To Reproduce
- Deploy operator on chart v1.2.1 with
ENABLE_CONTROLLER_KONNECT=true - Deploy a GatewayConfiguration (authRef only, no explicit
source), GatewayClass, and Gateway. Wait for all resources to reach Ready state. - Confirm the DataPlane deployment has Konnect env vars present (e.g. 33 total env vars, 10 Konnect-specific). Record deployment generation (gen=1).
- Corrupt the KonnectAPIAuthConfiguration secret (replace with an invalid token):
kubectl create secret generic konnect-api-auth-secret -n <namespace> \ --from-literal=token=kpat_INVALID_TOKEN \ --dry-run=client -o yaml | kubectl apply -f -
- Upgrade the operator to chart v1.2.2 (triggers cold-start reconciliation):
helm upgrade kong-operator kong/kong-operator --version 1.2.2 -n <namespace> \ --set env.ENABLE_CONTROLLER_KONNECT=true --reuse-values
- Observe: KonnectExtension goes Ready=False, operator rebuilds DataPlane deployment without Konnect env vars. Deployment generation increments (1 to 2), env var count drops (33 to 23), new ReplicaSet created with pods missing Konnect connection.
- Restore the valid secret. Observe: KonnectExtension goes Ready=True, operator re-adds Konnect env vars. Deployment generation increments again (2 to 3), another rolling update occurs.
Result: Two rolling updates, DataPlane loses Konnect connectivity for the entire duration between steps 6 and 7.
Operator Version
- Kong Operator app v2.1.2 (Helm chart v1.2.2)
- Also confirmed on v2.1.1 (Helm chart v1.2.1) with operator restart instead of upgrade
kubectl version
Client Version: v1.32.x
Server Version: v1.31.x (EKS)