- 
                Notifications
    You must be signed in to change notification settings 
- Fork 1.2k
✨ Add support for controller-manager webhook shutdown delay #2601
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
9501679
              bf963c8
              f9393cb
              12f9864
              5c3c96e
              4bd1de4
              384d8ce
              2b4c651
              3529110
              dca33cb
              9351414
              File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change | 
|---|---|---|
|  | @@ -101,6 +101,10 @@ type Options struct { | |
|  | ||
| // WebhookMux is the multiplexer that handles different webhooks. | ||
| WebhookMux *http.ServeMux | ||
|  | ||
| // ShutdownDelay delays server shutdown to wait for clients to stop opening | ||
| // new connections before closing server listeners. Defaults to 0. | ||
| ShutdownDelay time.Duration | ||
| } | ||
|  | ||
| // NewServer constructs a new webhook.Server from the provided options. | ||
|  | @@ -246,6 +250,15 @@ func (s *DefaultServer) Start(ctx context.Context) error { | |
| idleConnsClosed := make(chan struct{}) | ||
| go func() { | ||
| <-ctx.Done() | ||
|  | ||
| // Disable HTTP keep-alives to close persistent connections after next | ||
| // HTTP request. Clients may reconnect until routes are updated and | ||
| // server listeners are closed however this should start gradually | ||
| // migrating clients to server instances that are not about to shutdown | ||
| srv.SetKeepAlivesEnabled(false) | ||
| 
      Comment on lines
    
      +254
     to 
      +258
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This seems a good addition | ||
| // Wait before shutting down webhook server | ||
| time.Sleep(s.Options.ShutdownDelay) | ||
| 
      Comment on lines
    
      +259
     to 
      +260
    
   There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Why is this needed? Reading the intent of the PR, there is a context with a timeout that will be given to  There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Although  This comment speaks more about how kube-proxy iptables are involved in webhook connectivity: #2601 (comment) | ||
|  | ||
| log.Info("Shutting down webhook server with timeout of 1 minute") | ||
|  | ||
| ctx, cancel := context.WithTimeout(context.Background(), 1*time.Minute) | ||
|  | ||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Open question:
Would it make sense to set a > 0 default, eg. 15s which is less than the 30 seconds default graceful shutdown delay?
Uh oh!
There was an error while loading. Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tricky one for sure, but because it's so specific to the Kubernetes cluster you're running on it doesn't feel right to align to any particular cluster or cloud provider (e.g. I'm not sure what value EKS and AKS use for example).
IMO we should keep it at 0 for backwards compatibility and to ensure that connection draining 'doesn't work' consistently across clusters by default. I think this is also more likely to encourage applications that care a lot about this (e.g. Gatekeeper) to expose a flag to change this to make it easier to reconfigure for the cluster you're running on (i.e. push the decision onto users of controller-runtime since an optimal default isn't clear).
No problems with changing to 15 seconds by default though if you were leaning towards that (that would be better for us anyway because we're using GKE 😄). We could alternatively align to the kube-proxy default of 1 second? But on GKE at least this won't help much.