pkg/start: Release leader lease on graceful shutdown

wking · wking · commit cc1921d18625 · 2020-08-25T10:51:45.000-07:00
So the incoming cluster-version operator doesn't need to wait for the outgoing operator's lease to expire, which can take a while [1]: I0802 10:06:01.056591 1 leaderelection.go:243] attempting to acquire leader lease openshift-cluster-version/version... ... I0802 10:07:42.632719 1 leaderelection.go:253] successfully acquired lease openshift-cluster-version/version and time out the: Cluster did not acknowledge request to upgrade in a reasonable time testcase [2]. Using ReleaseOnCancel has been the plan since 2b81f47 (cvo: Release our leader lease when we are gracefully terminated, 2019-01-16, #87). I'm not clear on why it (sometimes?) doesn't work today. The discrepancy between the "exit after 2s no matter what" comment and the 5s After dates back to dbedb7a (cvo: When the CVO restarts, perform one final sync to write status, 2019-04-27, #179), which bumped the After from 2s to 5s, but forgot to bump the comment. I'm removing that code here in favor of the two-minute timeout from b30aa0e (pkg/cvo/metrics: Graceful server shutdown, 2020-04-15, #349). We still exit immediately on a second TERM, for folks who get impatient waiting for the graceful timeout. Decouple shutdownContext from the context passed into Options.run, to allow TestIntegrationCVO_gracefulStepDown to request a graceful shutdown. And remove Context.Start(), inlining the logic in Options.run so we can count and reap the goroutines it used to launch. This also allows us to be more targeted with the context for each goroutines: * Informers are now launched before the lease controller, so they're up and running by the time we acquire the lease. They remain running until the main operator CVO.Run() exits, after which we shut them down. Having informers running before we have a lease is somewhat expensive in terms of API traffic, but we should rarely have two CVO pods competing for leadership since we transitioned to the Recreate Deployment strategy in 078686d (install/0000_00_cluster-version-operator_03_deployment: Set 'strategy: Recreate', 2019-03-20, #140) and 5d8a527 (install/0000_00_cluster-version-operator_03_deployment: Fix Recreate strategy, 2019-04-03, #155). I don't see a way to block on their internal goroutine's completion, but maybe informers will grow an API for that in the future. * The metrics server also continues to run until CVO.Run() exits, where previously we began gracefully shutting it down at the same time we started shutting down CVO.Run(). This ensures we are around and publishing any last-minute CVO.Run() changes. * Leader election also continues to run until CVO.Run() exits. We don't want to release the lease while we're still controlling things. * CVO.Run() and AutoUpdate.Run() both stop immediately when the passed-in context is canceled or we call runCancel internally (because of a TERM, error from a goroutine, or loss of leadership). These are the only two goroutines that are actually writing to the API servers, so we want to shut them down as quickly as possible. Drop an unnecessary runCancel() from the "shutting down" branch of the error collector. I'd added it in b30aa0e, but you can only ever get into the "shutting down" branch if runCancel has already been called. And fix the scoping for the shutdownTimer variable so we don't clear it on each for-loop iteration (oops :p, bug from b30aa0e). Add some logging to the error collector, so it's easier to see where we are in the collection process from the operator logs. Also start logging collected goroutines by name, so we can figure out which may still be outstanding. Set terminationGracePeriodSeconds 130 to extend the default 30s [3], to give the container the full two-minute graceful timeout window before the kubelet steps in with a KILL. Push the Background() initialization all the way up to the command-line handler, to make it more obvious that the context is scoped to the whole 'start' invocation. [1]: https://storage.googleapis.com/origin-ci-test/pr-logs/pull/25365/pull-ci-openshift-origin-master-e2e-gcp-upgrade/1289853267223777280/artifacts/e2e-gcp-upgrade/pods/openshift-cluster-version_cluster-version-operator-5b6ff896c6-57ppb_cluster-version-operator.log [2]: https://bugzilla.redhat.com/show_bug.cgi?id=1843505#c7 [3]: https://kubernetes.io/docs/reference/generated/kubernetes-api/v1.18/#podspec-v1-core squash! pkg/start: Release leader lease on graceful shutdown
diff --git a/bootstrap/bootstrap-pod.yaml b/bootstrap/bootstrap-pod.yaml
@@ -37,6 +37,7 @@ spec:
           fieldRef:
             fieldPath: spec.nodeName
   hostNetwork: true
+  terminationGracePeriodSeconds: 130
   volumes:
     - name: kubeconfig
       hostPath:
diff --git a/cmd/start.go b/cmd/start.go
@@ -1,6 +1,8 @@
 package main
 
 import (
+	"context"
+
 	"github.com/spf13/cobra"
 	"k8s.io/klog"
 
@@ -18,9 +20,10 @@ func init() {
 			// To help debugging, immediately log version
 			klog.Infof("%s", version.String)
 
-			if err := opts.Run(); err != nil {
+			if err := opts.Run(context.Background()); err != nil {
 				klog.Fatalf("error: %v", err)
 			}
+			klog.Info("Graceful shutdown complete.")
 		},
 	}
 
diff --git a/install/0000_00_cluster-version-operator_03_deployment.yaml b/install/0000_00_cluster-version-operator_03_deployment.yaml
@@ -57,6 +57,7 @@ spec:
       nodeSelector:
         node-role.kubernetes.io/master: ""
       priorityClassName: "system-cluster-critical"
+      terminationGracePeriodSeconds: 130
       tolerations:
       - key: "node-role.kubernetes.io/master"
         operator: Exists
diff --git a/pkg/autoupdate/autoupdate.go b/pkg/autoupdate/autoupdate.go
@@ -84,23 +84,24 @@ func New(
 }
 
 // Run runs the autoupdate controller.
-func (ctrl *Controller) Run(ctx context.Context, workers int) {
+func (ctrl *Controller) Run(ctx context.Context, workers int) error {
 	defer utilruntime.HandleCrash()
 	defer ctrl.queue.ShutDown()
 
 	klog.Info("Starting AutoUpdateController")
 	defer klog.Info("Shutting down AutoUpdateController")
 
 	if !cache.WaitForCacheSync(ctx.Done(), ctrl.cacheSynced...) {
-		klog.Info("Caches never synchronized")
-		return
+		return fmt.Errorf("caches never synchronized: %w", ctx.Err())
 	}
 
 	for i := 0; i < workers; i++ {
+		// FIXME: actually wait until these complete if the Context is canceled.
 		go wait.UntilWithContext(ctx, ctrl.worker, time.Second)
 	}
 
 	<-ctx.Done()
+	return nil
 }
 
 func (ctrl *Controller) eventHandler() cache.ResourceEventHandler {
diff --git a/pkg/cvo/cvo.go b/pkg/cvo/cvo.go
@@ -290,7 +290,7 @@ func loadConfigMapVerifierDataFromUpdate(update *payload.Update, clientBuilder s
 }
 
 // Run runs the cluster version operator until stopCh is completed. Workers is ignored for now.
-func (optr *Operator) Run(ctx context.Context, workers int) {
+func (optr *Operator) Run(ctx context.Context, workers int) error {
 	defer utilruntime.HandleCrash()
 	defer optr.queue.ShutDown()
 	stopCh := ctx.Done()
@@ -300,8 +300,7 @@ func (optr *Operator) Run(ctx context.Context, workers int) {
 	defer klog.Info("Shutting down ClusterVersionOperator")
 
 	if !cache.WaitForCacheSync(stopCh, optr.cacheSynced...) {
-		klog.Info("Caches never synchronized")
-		return
+		return fmt.Errorf("caches never synchronized: %w", ctx.Err())
 	}
 
 	// trigger the first cluster version reconcile always
@@ -330,6 +329,8 @@ func (optr *Operator) Run(ctx context.Context, workers int) {
 	// stop the queue, then wait for the worker to exit
 	optr.queue.ShutDown()
 	<-workerStopCh
+
+	return nil
 }
 
 func (optr *Operator) queueKey() string {
diff --git a/pkg/start/start.go b/pkg/start/start.go
@@ -10,7 +10,6 @@ import (
 	"math/rand"
 	"os"
 	"os/signal"
-	"sync"
 	"syscall"
 	"time"
 
@@ -77,6 +76,11 @@ type Options struct {
 	ResyncInterval  time.Duration
 }
 
+type asyncResult struct {
+	name  string
+	error error
+}
+
 func defaultEnv(name, defaultValue string) string {
 	env, ok := os.LookupEnv(name)
 	if !ok {
@@ -101,7 +105,7 @@ func NewOptions() *Options {
 	}
 }
 
-func (o *Options) Run() error {
+func (o *Options) Run(ctx context.Context) error {
 	if o.NodeName == "" {
 		return fmt.Errorf("node-name is required")
 	}
@@ -137,29 +141,6 @@ func (o *Options) Run() error {
 		return err
 	}
 
-	// TODO: Kube 1.14 will contain a ReleaseOnCancel boolean on
-	//   LeaderElectionConfig that allows us to have the lock code
-	//   release the lease when this context is cancelled. At that
-	//   time we can remove our changes to OnStartedLeading.
-	ctx, cancel := context.WithCancel(context.Background())
-	defer cancel()
-	ch := make(chan os.Signal, 1)
-	defer func() { signal.Stop(ch) }()
-	signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
-	go func() {
-		sig := <-ch
-		klog.Infof("Shutting down due to %s", sig)
-		cancel()
-
-		// exit after 2s no matter what
-		select {
-		case <-time.After(5 * time.Second):
-			klog.Fatalf("Exiting")
-		case <-ch:
-			klog.Fatalf("Received shutdown signal twice, exiting")
-		}
-	}()
-
 	o.run(ctx, controllerCtx, lock)
 	return nil
 }
@@ -186,13 +167,32 @@ func (o *Options) makeTLSConfig() (*tls.Config, error) {
 	}), nil
 }
 
+// run launches a number of goroutines to handle manifest application,
+// metrics serving, etc.  It continues operating until ctx.Done(),
+// and then attempts a clean shutdown limited by an internal context
+// with a two-minute cap.  It returns after it successfully collects all
+// launched goroutines.
 func (o *Options) run(ctx context.Context, controllerCtx *Context, lock *resourcelock.ConfigMapLock) {
-	runContext, runCancel := context.WithCancel(ctx)
+	runContext, runCancel := context.WithCancel(ctx) // so we can cancel internally on errors or TERM
 	defer runCancel()
-	shutdownContext, shutdownCancel := context.WithCancel(ctx)
+	shutdownContext, shutdownCancel := context.WithCancel(context.Background()) // extends beyond ctx
 	defer shutdownCancel()
-	errorChannel := make(chan error, 1)
-	errorChannelCount := 0
+	postMainContext, postMainCancel := context.WithCancel(context.Background()) // extends beyond ctx
+	defer postMainCancel()
+
+	ch := make(chan os.Signal, 1)
+	defer func() { signal.Stop(ch) }()
+	signal.Notify(ch, os.Interrupt, syscall.SIGTERM)
+	go func() {
+		sig := <-ch
+		klog.Infof("Shutting down due to %s", sig)
+		runCancel()
+		sig = <-ch
+		klog.Fatalf("Received shutdown signal twice, exiting: %s", sig)
+	}()
+
+	resultChannel := make(chan asyncResult, 1)
+	resultChannelCount := 0
 	if o.ListenAddr != "" {
 		var tlsConfig *tls.Config
 		if o.ServingCertFile != "" || o.ServingKeyFile != "" {
@@ -202,85 +202,92 @@ func (o *Options) run(ctx context.Context, controllerCtx *Context, lock *resourc
 				klog.Fatalf("Failed to create TLS config: %v", err)
 			}
 		}
-		errorChannelCount++
+		resultChannelCount++
 		go func() {
-			errorChannel <- cvo.RunMetrics(runContext, shutdownContext, o.ListenAddr, tlsConfig)
+			err := cvo.RunMetrics(postMainContext, shutdownContext, o.ListenAddr, tlsConfig)
+			resultChannel <- asyncResult{name: "metrics server", error: err}
 		}()
 	}
 
-	exit := make(chan struct{})
-	exitClose := sync.Once{}
-
-	// TODO: when we switch to graceful lock shutdown, this can be
-	// moved back inside RunOrDie
-	// TODO: properly wire ctx here
-	go leaderelection.RunOrDie(context.TODO(), leaderelection.LeaderElectionConfig{
-		Lock:          lock,
-		LeaseDuration: leaseDuration,
-		RenewDeadline: renewDeadline,
-		RetryPeriod:   retryPeriod,
-		Callbacks: leaderelection.LeaderCallbacks{
-			OnStartedLeading: func(localCtx context.Context) {
-				controllerCtx.Start(runContext)
-				select {
-				case <-runContext.Done():
-					// WARNING: this is not completely safe until we have Kube 1.14 and ReleaseOnCancel
-					//   and client-go ContextCancelable, which allows us to block new API requests before
-					//   we step down. However, the CVO isn't that sensitive to races and can tolerate
-					//   brief overlap.
-					klog.Infof("Stepping down as leader")
-					// give the controllers some time to shut down
-					time.Sleep(100 * time.Millisecond)
-					// if we still hold the leader lease, clear the owner identity (other lease watchers
-					// still have to wait for expiration) like the new ReleaseOnCancel code will do.
-					if err := lock.Update(localCtx, resourcelock.LeaderElectionRecord{}); err == nil {
-						// if we successfully clear the owner identity, we can safely delete the record
-						if err := lock.Client.ConfigMaps(lock.ConfigMapMeta.Namespace).Delete(localCtx, lock.ConfigMapMeta.Name, metav1.DeleteOptions{}); err != nil {
-							klog.Warningf("Unable to step down cleanly: %v", err)
-						}
+	informersDone := postMainContext.Done()
+	// FIXME: would be nice if there was a way to collect these.
+	controllerCtx.CVInformerFactory.Start(informersDone)
+	controllerCtx.OpenshiftConfigInformerFactory.Start(informersDone)
+	controllerCtx.OpenshiftConfigManagedInformerFactory.Start(informersDone)
+	controllerCtx.InformerFactory.Start(informersDone)
+
+	resultChannelCount++
+	go func() {
+		leaderelection.RunOrDie(postMainContext, leaderelection.LeaderElectionConfig{
+			Lock:            lock,
+			ReleaseOnCancel: true,
+			LeaseDuration:   leaseDuration,
+			RenewDeadline:   renewDeadline,
+			RetryPeriod:     retryPeriod,
+			Callbacks: leaderelection.LeaderCallbacks{
+				OnStartedLeading: func(_ context.Context) { // no need for this passed-through postMainContext, because goroutines we launch inside will use runContext
+					resultChannelCount++
+					go func() {
+						err := controllerCtx.CVO.Run(runContext, 2)
+						resultChannel <- asyncResult{name: "main operator", error: err}
+					}()
+
+					if controllerCtx.AutoUpdate != nil {
+						resultChannelCount++
+						go func() {
+							err := controllerCtx.AutoUpdate.Run(runContext, 2)
+							resultChannel <- asyncResult{name: "auto-update controller", error: err}
+						}()
 					}
-					klog.Infof("Finished shutdown")
-					exitClose.Do(func() { close(exit) })
-				case <-localCtx.Done():
-					// we will exit in OnStoppedLeading
-				}
-			},
-			OnStoppedLeading: func() {
-				klog.Warning("leaderelection lost")
-				exitClose.Do(func() { close(exit) })
+				},
+				OnStoppedLeading: func() {
+					klog.Info("Stopped leading; shutting down.")
+					runCancel()
+				},
 			},
-		},
-	})
+		})
+		resultChannel <- asyncResult{name: "leader controller", error: nil}
+	}()
 
-	for errorChannelCount > 0 {
-		var shutdownTimer *time.Timer
+	var shutdownTimer *time.Timer
+	for resultChannelCount > 0 {
+		klog.Infof("Waiting on %d outstanding goroutines.", resultChannelCount)
 		if shutdownTimer == nil { // running
 			select {
 			case <-runContext.Done():
+				klog.Info("Run context completed; beginning two-minute graceful shutdown period.")
 				shutdownTimer = time.NewTimer(2 * time.Minute)
-			case err := <-errorChannel:
-				errorChannelCount--
-				if err != nil {
-					klog.Error(err)
+			case result := <-resultChannel:
+				resultChannelCount--
+				if result.error == nil {
+					klog.Infof("Collected %s goroutine.", result.name)
+				} else {
+					klog.Errorf("Collected %s goroutine: %v", result.name, result.error)
 					runCancel() // this will cause shutdownTimer initialization in the next loop
 				}
+				if result.name == "main operator" {
+					postMainCancel()
+				}
 			}
 		} else { // shutting down
 			select {
 			case <-shutdownTimer.C: // never triggers after the channel is stopped, although it would not matter much if it did because subsequent cancel calls do nothing.
 				shutdownCancel()
 				shutdownTimer.Stop()
-			case err := <-errorChannel:
-				errorChannelCount--
-				if err != nil {
-					klog.Error(err)
-					runCancel()
+			case result := <-resultChannel:
+				resultChannelCount--
+				if result.error == nil {
+					klog.Infof("Collected %s goroutine.", result.name)
+				} else {
+					klog.Errorf("Collected %s goroutine: %v", result.name, result.error)
+				}
+				if result.name == "main operator" {
+					postMainCancel()
 				}
 			}
 		}
 	}
-
-	<-exit
+	klog.Info("Finished collecting operator goroutines.")
 }
 
 // createResourceLock initializes the lock.
@@ -440,17 +447,3 @@ func (o *Options) NewControllerContext(cb *ClientBuilder) *Context {
 	}
 	return ctx
 }
-
-// Start launches the controllers in the provided context and any supporting
-// infrastructure. When ch is closed the controllers will be shut down.
-func (c *Context) Start(ctx context.Context) {
-	ch := ctx.Done()
-	go c.CVO.Run(ctx, 2)
-	if c.AutoUpdate != nil {
-		go c.AutoUpdate.Run(ctx, 2)
-	}
-	c.CVInformerFactory.Start(ch)
-	c.OpenshiftConfigInformerFactory.Start(ch)
-	c.OpenshiftConfigManagedInformerFactory.Start(ch)
-	c.InformerFactory.Start(ch)
-}
diff --git a/pkg/start/start_integration_test.go b/pkg/start/start_integration_test.go

Original file line number	Diff line number	Diff line change
`@@ -84,23 +84,24 @@ func New(`
`84`	`84`	`}`
`85`	`85`
`86`	`86`	`// Run runs the autoupdate controller.`
`87`		`-func (ctrl *Controller) Run(ctx context.Context, workers int) {`
	`87`	`+func (ctrl *Controller) Run(ctx context.Context, workers int) error {`
`88`	`88`	`defer utilruntime.HandleCrash()`
`89`	`89`	`defer ctrl.queue.ShutDown()`
`90`	`90`
`91`	`91`	`klog.Info("Starting AutoUpdateController")`
`92`	`92`	`defer klog.Info("Shutting down AutoUpdateController")`
`93`	`93`
`94`	`94`	`if !cache.WaitForCacheSync(ctx.Done(), ctrl.cacheSynced...) {`
`95`		`- klog.Info("Caches never synchronized")`
`96`		`- return`
	`95`	`+ return fmt.Errorf("caches never synchronized: %w", ctx.Err())`
`97`	`96`	`}`
`98`	`97`
`99`	`98`	`for i := 0; i < workers; i++ {`
	`99`	`+ // FIXME: actually wait until these complete if the Context is canceled.`
`100`	`100`	`go wait.UntilWithContext(ctx, ctrl.worker, time.Second)`
`101`	`101`	`}`
`102`	`102`
`103`	`103`	`<-ctx.Done()`
	`104`	`+ return nil`
`104`	`105`	`}`
`105`	`106`
`106`	`107`	`func (ctrl *Controller) eventHandler() cache.ResourceEventHandler {`