-
Notifications
You must be signed in to change notification settings - Fork 45
Description
Recently in a real-world server (that has 11 Service
instances running in a ServiceGroup
) I had an issue where one of the services would exit prematurely, which would trigger the whole group's cancelation, but unfortunately some of the services swallowed the cancelation and never returned, leaving my server in a half-working state indefinitely.
Initially, the easy fix was to restart the server, while I investigated further.
I realized that what I needed was both fixing the non-canceling services, but also setting:
serviceGroupConfiguration.maximumGracefulShutdownDuration = .seconds(15)
serviceGroupConfiguration.maximumCancellationDuration = .seconds(60)
So if it happens again, the server won't be half-broken indefinitely, but will tear itself down after a finite amount of time. That's great.
While in some cases, developers might want to keep a server running at all cost, even when a subset of their Service instances have exited, I think it's pretty common, especially in supervised environments (k8s, on VMs with a supervisor process) to just let the program exit when it's clearly broken without a chance to self-recover, as a restart will bring it back up healthy again.
To that end, I think we should document these timeouts in https://swiftpackageindex.com/swift-server/swift-service-lifecycle/2.8.0/documentation/servicelifecycle/how-to-adopt-servicelifecycle-in-applications
We should even consider recommending setting these timeouts in new projects. I realize we can't change the defaults now, but if we were doing the 1.0 now, I'd strongly argue for generous, but finite timeouts by default. So let's at least document this now, as I believe it's the wrong practice for most servers to not have these timeouts set, and actually leads to less resilient servers (because a half-broken server running indefinitely is worse than a restarted server that comes back healthy).