Controller in leader election loop after losing network access

My controller with leader election lost network access to the API Server. After losing leader election, the controller is restarted once. On restart the controller repeatedly tries to get the lease, but doesn't get network access. It ends up stuck in a loop of getting "connection refused" when it tries to get the leader.

In my case the controller was running with `replicas=1` meaning the error stopped reconciliation of resources until someone checked and manually restarted the pod

It would be useful if the controller failed when in this state so users would have a signal about what's wrong in this case.

*To reproduce*

1) Start a controller with leader election
2) Disrupt network access to the API Server: `nsenter -t $PID -n iptables -A OUTPUT -p tcp --dport 443 -j DROP`
3) Observe the controller restarting once and then getting stuck in a loop with logs like:

```
E0822 08:45:07.492385       1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:09.903736       1 leaderelection.go:436] error retrieving resource lock ....
E0822 08:45:12.148482       1 leaderelection.go:436] error retrieving resource lock ....
```


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Controller in leader election loop after losing network access #3298

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Controller in leader election loop after losing network access #3298

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions