-
Notifications
You must be signed in to change notification settings - Fork 1.2k
Open
Description
My controller with leader election lost network access to the API Server. After losing leader election, the controller is restarted once. On restart the controller repeatedly tries to get the lease, but doesn't get network access. It ends up stuck in a loop of getting "connection refused" when it tries to get the leader.
In my case the controller was running with replicas=1
meaning the error stopped reconciliation of resources until someone checked and manually restarted the pod
It would be useful if the controller failed when in this state so users would have a signal about what's wrong in this case.
To reproduce
- Start a controller with leader election
- Disrupt network access to the API Server:
nsenter -t $PID -n iptables -A OUTPUT -p tcp --dport 443 -j DROP
- Observe the controller restarting once and then getting stuck in a loop with logs like:
E0822 08:45:07.492385 1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:09.903736 1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:12.148482 1 leaderelection.go:436] error retrieving resource lock ....
Metadata
Metadata
Assignees
Labels
No labels