Skip to content

Commit 70c9f77

Browse files
author
Michael Gasch
committed
Add etcd WithRequireLeader option to API watches
Watches against etcd in the API server can hang forever if the etcd cluster loses quorum, e.g. the majority of nodes crashes. This fix improves responsiveness (detection and reaction time) of API server watches against etcd in some rare (but still possible) edge cases so that watches are terminated with `"etcdserver: no leader" (ErrNoLeader)`. Implementation behavior described by jingyih: ``` The etcd server waits until it cannot find a leader for 3 election timeouts to cancel existing streams. 3 is currently a hard coded constant. The election timeout defaults to 1000ms. If the cluster is healthy, when the leader is stopped, the leadership transfer should be smooth. (leader transfers its leadership before stopping). If leader is hard killed, other servers will take an election timeout to realize leader lost and start campaign. ``` For further details, discussion and validation see kubernetes#89488 (comment) and etcd-io/etcd#8980. Closes: kubernetes#89488 Signed-off-by: Michael Gasch <[email protected]>
1 parent c1f1b1b commit 70c9f77

File tree

1 file changed

+9
-1
lines changed
  • staging/src/k8s.io/apiserver/pkg/storage/etcd3

1 file changed

+9
-1
lines changed

staging/src/k8s.io/apiserver/pkg/storage/etcd3/watcher.go

Lines changed: 9 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -126,7 +126,15 @@ func (w *watcher) createWatchChan(ctx context.Context, key string, rev int64, re
126126
// The filter doesn't filter out any object.
127127
wc.internalPred = storage.Everything
128128
}
129-
wc.ctx, wc.cancel = context.WithCancel(ctx)
129+
130+
// The etcd server waits until it cannot find a leader for 3 election
131+
// timeouts to cancel existing streams. 3 is currently a hard coded
132+
// constant. The election timeout defaults to 1000ms. If the cluster is
133+
// healthy, when the leader is stopped, the leadership transfer should be
134+
// smooth. (leader transfers its leadership before stopping). If leader is
135+
// hard killed, other servers will take an election timeout to realize
136+
// leader lost and start campaign.
137+
wc.ctx, wc.cancel = context.WithCancel(clientv3.WithRequireLeader(ctx))
130138
return wc
131139
}
132140

0 commit comments

Comments
 (0)