-
Notifications
You must be signed in to change notification settings - Fork 631
Description
Description:
The Envoy Gateway controller panics with a nil pointer dereference when processing BackendTrafficPolicies for HTTPRoutes that have cross-namespace backend references without a matching ReferenceGrant.
The invalid reference is correctly detected and logged as an error, but the controller then panics instead of gracefully skipping the route. This causes the gateway-api reconciliation loop to restart repeatedly (~every 5 seconds), which can delay xDS updates to Envoy proxies.
Expected behavior: When an HTTPRoute has an invalid cross-namespace reference, the controller should log the error, skip applying BackendTrafficPolicy features to that route, and continue processing other routes without panicking.
Repro steps:
- Create an HTTPRoute in namespace
envoy-gateway-mtls-appthat references a Service in namespacedefault:
apiVersion: gateway.networking.k8s.io/v1
kind: HTTPRoute
metadata:
name: reflector
namespace: envoy-gateway-mtls-app
spec:
parentRefs:
- name: gateway-mtls-reflector
namespace: envoy-gateway-mtls-app
rules:
- matches:
- path:
type: PathPrefix
value: "/"
backendRefs:
- name: reflector
namespace: default # cross-namespace reference
port: 80-
Do NOT create a ReferenceGrant allowing this cross-namespace reference
-
Have any BackendTrafficPolicy in the cluster (doesn't need to target this route)
-
Observe controller logs - panic occurs on every reconciliation
Environment:
- Envoy Gateway version: v1.6.0
- Kubernetes: GKE 1.31
- Go version (from stack trace): 1.25.3
- Single controller managing multiple Gateways across namespaces
Logs:
First the ReferenceGrant error is logged:
ERROR provider kubernetes/routes.go:269 failed to process BackendRef for HTTPRoute
{"runner": "provider",
"httpRoute": {"name":"reflector","namespace":"envoy-gateway-mtls-app"},
"backendRef": {"group":"","kind":"Service","name":"reflector","namespace":"default","port":80},
"error": "no matching ReferenceGrants found: from HTTPRoute/envoy-gateway-mtls-app to Service/default"}
Then immediately the panic:
ERROR watchable message/watchutil.go:57 observed a panic
{"runner": "gateway-api",
"error": "runtime error: invalid memory address or nil pointer dereference",
"stackTrace": "goroutine 216 [running]:
runtime/debug.Stack()
/opt/hostedtoolcache/go/1.25.3/x64/src/runtime/debug/stack.go:26 +0x5e
github.com/envoyproxy/gateway/internal/message.handleWithCrashRecovery[...].func1()
/home/runner/work/gateway/gateway/internal/message/watchutil.go:58 +0x1fe
panic({0x3552d00?, 0xb615380?})
/opt/hostedtoolcache/go/1.25.3/x64/src/runtime/panic.go:783 +0x132
github.com/envoyproxy/gateway/internal/gatewayapi.(*Translator).applyTrafficFeatureToRoute(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/backendtrafficpolicy.go:765 +0x768
github.com/envoyproxy/gateway/internal/gatewayapi.(*Translator).translateBackendTrafficPolicyForRoute(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/backendtrafficpolicy.go:635 +0x2ca
github.com/envoyproxy/gateway/internal/gatewayapi.(*Translator).processBackendTrafficPolicyForRoute(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/backendtrafficpolicy.go:301 +0xa0b
github.com/envoyproxy/gateway/internal/gatewayapi.(*Translator).ProcessBackendTrafficPolicies(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/backendtrafficpolicy.go:107 +0x197c
github.com/envoyproxy/gateway/internal/gatewayapi.(*Translator).Translate(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/translator.go:284 +0x848
github.com/envoyproxy/gateway/internal/gatewayapi/runner.(*Runner).subscribeAndTranslate.func1(...)
/home/runner/work/gateway/gateway/internal/gatewayapi/runner/runner.go:176 +0x571"}
Workaround: Create a ReferenceGrant to allow the cross-namespace reference, or move the HTTPRoute and Service to the same namespace.
Observability:
The panic is observable via the watchable_panics_recovered_total metric:
rate(watchable_panics_recovered_total{runner="gateway-api", status="failure"}[5m]) > 0
Note that standard controller-runtime metrics don't capture this panic:
controller_runtime_reconcile_panics_totalstays at 0 (different code path)controller_runtime_reconcile_errors_totalstays at 0- Pod does not restart
The watchable_panics_recovered_total metric only increments when reconciliation is triggered (e.g., on resource changes). A cluster can be in a broken steady state with a flat counter if no changes occur.