You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
pkg/start: Release leader lease on graceful shutdown
So the incoming cluster-version operator doesn't need to wait for the
outgoing operator's lease to expire, which can take a while [1]:
I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version...
...
I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version
and time out the:
Cluster did not acknowledge request to upgrade in a reasonable time
testcase [2]. Using ReleaseOnCancel has been the plan since
2b81f47 (cvo: Release our leader lease when we are gracefully
terminated, 2019-01-16, #87). I'm not clear on why it (sometimes?)
doesn't work today.
The discrepancy between the "exit after 2s no matter what" comment and
the 5s After dates back to dbedb7a (cvo: When the CVO restarts,
perform one final sync to write status, 2019-04-27, #179), which
bumped the After from 2s to 5s, but forgot to bump the comment. I'm
removing that code here in favor of the two-minute timeout from
b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, #349).
We still exit immediately on a second TERM, for folks who get
impatient waiting for the graceful timeout.
Decouple shutdownContext from the context passed into Options.run, to
allow TestIntegrationCVO_gracefulStepDown to request a graceful
shutdown. And remove Context.Start(), inlining the logic in
Options.run so we can count and reap the goroutines it used to launch.
This also allows us to be more targeted with the context for each
goroutines:
* Informers are now launched before the lease controller, so they're
up and running by the time we acquire the lease. They remain
running until the main operator CVO.Run() exits, after which we shut
them down. Having informers running before we have a lease is
somewhat expensive in terms of API traffic, but we should rarely
have two CVO pods competing for leadership since we transitioned to
the Recreate Deployment strategy in 078686d
(install/0000_00_cluster-version-operator_03_deployment: Set
'strategy: Recreate', 2019-03-20, #140) and 5d8a527
(install/0000_00_cluster-version-operator_03_deployment: Fix
Recreate strategy, 2019-04-03, #155). I don't see a way to block on
their internal goroutine's completion, but maybe informers will grow
an API for that in the future.
* The metrics server also continues to run until CVO.Run() exits,
where previously we began gracefully shutting it down at the same
time we started shutting down CVO.Run(). This ensures we are around
and publishing any last-minute CVO.Run() changes.
* Leader election also continues to run until CVO.Run() exits. We
don't want to release the lease while we're still controlling
things.
* CVO.Run() and AutoUpdate.Run() both stop immediately when the
passed-in context is canceled or we call runCancel internally
(because of a TERM, error from a goroutine, or loss of leadership).
These are the only two goroutines that are actually writing to the
API servers, so we want to shut them down as quickly as possible.
Drop an unnecessary runCancel() from the "shutting down" branch of the
error collector. I'd added it in b30aa0e, but you can only ever
get into the "shutting down" branch if runCancel has already been
called. And fix the scoping for the shutdownTimer variable so we
don't clear it on each for-loop iteration (oops :p, bug from
b30aa0e).
Add some logging to the error collector, so it's easier to see where
we are in the collection process from the operator logs. Also start
logging collected goroutines by name, so we can figure out which may
still be outstanding.
Set terminationGracePeriodSeconds 130 to extend the default 30s [3],
to give the container the full two-minute graceful timeout window
before the kubelet steps in with a KILL.
Push the Background() initialization all the way up to the
command-line handler, to make it more obvious that the context is
scoped to the whole 'start' invocation.
[1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log
[2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7
[3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core
squash! pkg/start: Release leader lease on graceful shutdown
OnStartedLeading: func(_ context.Context) { // no need for this passed-through postMainContext, because goroutines we launch inside will use runContext
runCancel() // this will cause shutdownTimer initialization in the next loop
266
267
}
268
+
ifresult.name=="main operator" {
269
+
postMainCancel()
270
+
}
267
271
}
268
272
} else { // shutting down
269
273
select {
270
274
case<-shutdownTimer.C: // never triggers after the channel is stopped, although it would not matter much if it did because subsequent cancel calls do nothing.
0 commit comments