-
Notifications
You must be signed in to change notification settings - Fork 2k
Description
Version
4.0.1
What Kubernetes platforms are you running on?
AKS Azure
Steps to reproduce
I've spotted that if you modify an existing deployment in a certain way (more on this below) then sometimes Nginx fails to resolve the correct IP address for the destination and so the conf file contains something like this at the top:
upstream example-minion-example.com-example.com-example-443 {zone example-minion-example.com-example.com-example-443 256k;
random two least_conn;
server 127.0.0.1:8181 max_fails=1 fail_timeout=10s max_conns=0;
}
# all content below removed from the example but also completely fine
The problem is this bit: server 127.0.0.1:8181
In my setup and in the environment where I'm testing this I have 2x replicas of the Nginx Ingress Controller pods.
On all occasions where I've replicated this issue, only 1 pod retains this invalid conf. During a config reload I can observe both pods initially with server 127.0.0.1:8181
but shortly after one of the Pods resolves to the correct address. i.e.
upstream example-minion-example.com-example.com-example-443 {zone example-minion-example.com-example.com-example-443 256k;
random two least_conn;
server 10.156.67.181:5001 max_fails=1 fail_timeout=10s max_conns=0;
}
In order to replicate this, I deploy a K8s Deployment, Service, and Ingress resource. The deployment spec configures 1 containerPort, and the Service also has 1 port (I start with 80), and the Ingress resource is configured to route all traffic to the Service on port 80.
Everything is fine at this point.
Then to trigger the issue, redeploy the aforementioned resources but the Deployment now configures 2x containerPorts, and the Service has 2 ports (80 + 443 - each mapping to one of the container ports), and the Ingress resource is updated to route all traffic to the Service on port 443.
One of the two Nginx Pods will handle this config just fine and will continue to route to the workload as expected; but the other will fail and start issuing 502 responses. Upon doing so the following logs will be presented (with debug loglevel set):
{"time":"2025-09-23T10:24:20.59820435Z","level":"DEBUG","source":{"file":"verify.go","line":90},"msg":"success, version 27 ensured. took: 128.491111ms"}
2025/09/23 10:24:20 [notice] 18#18: signal 17 (SIGCHLD) received from 204
2025/09/23 10:24:20 [notice] 18#18: worker process 204 exited with code 0
2025/09/23 10:24:20 [notice] 18#18: worker process 206 exited with code 0
2025/09/23 10:24:20 [notice] 18#18: signal 29 (SIGIO) received
2025/09/23 10:24:20 [notice] 18#18: signal 17 (SIGCHLD) received from 207
2025/09/23 10:24:20 [notice] 18#18: worker process 207 exited with code 0
2025/09/23 10:24:20 [notice] 18#18: signal 29 (SIGIO) received
2025/09/23 10:24:20 [notice] 18#18: signal 17 (SIGCHLD) received from 205
2025/09/23 10:24:20 [notice] 18#18: worker process 205 exited with code 0
2025/09/23 10:24:20 [notice] 18#18: signal 29 (SIGIO) received
{"time":"2025-09-23T10:24:20.846495916Z","level":"DEBUG","source":{"file":"endpoint_slice.go","line":33},"msg":"Removing EndpointSlice: example-pod-km6xm"}
{"time":"2025-09-23T10:24:20.846534116Z","level":"DEBUG","source":{"file":"task_queue.go","line":65},"msg":"Adding an element with a key: example-ns/example-pod-km6xm"}
{"time":"2025-09-23T10:24:20.846551817Z","level":"DEBUG","source":{"file":"task_queue.go","line":98},"msg":"Syncing example-ns/example-pod-km6xm"}
{"time":"2025-09-23T10:24:20.846561717Z","level":"DEBUG","source":{"file":"task_queue.go","line":77},"msg":"The queue has 0 element(s)"}
{"time":"2025-09-23T10:24:20.846568217Z","level":"DEBUG","source":{"file":"controller.go","line":1019},"msg":"Syncing example-ns/example-pod-km6xm"}
2025/09/23 10:25:58 [error] 211#211: *403 connect() failed (111: Connection refused) while connecting to upstream, client: 10.157.116.134, server: example.com, request: "GET /api/v1/example-path HTTP/1.1", upstream: "https://127.0.0.1:8181/api/v1/example-path", host: "example.com"
10.157.116.134 - - [23/Sep/2025:10:25:58 +0000] "GET /api/v1/example-path HTTP/1.1" 502 559 "-" "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/140.0.0.0 Safari/537.36" "xxx.xxx.xxx.xxx,yyy.yyy.yyy.yyy:37220,zzz.zzz.zzz.zzz"
_# Subsequent log lines then repeatedly show the 502 and connect refused entries as more requests come in_
In order to resolve the issue I simply need to trigger a config reload. I've demonstrated this works by either restarting the affected Nginx Pod; restarting the upstream workload; redeploying the upstream workload.
Note that I've so far been unable to replicate this issue when applying subsequent deployments. It seems to only affect the very first deployment which adds the additional port to the Service (or maybe it's the flip from port 80 to port 443 in the Ingress resource that might specifically cause it).