Thoughts on failure reporting in ServiceGroup:run()

A confession from a noob: I did not use ServiceLifecycle in the intended manner.  "Mea culpa, mea culpa, mea maxima culpa", but hear me out...

My application has two services: `GoodService` and `BadService`.  The former does everything expected by ServiceLifecycle.  The latter ignores cancellation and shutdown requests.  The code is quite vanilla and follows the [Readme](https://github.com/swift-server/swift-service-lifecycle/blob/main/README.md).
```
let serviceGroup = ServiceGroup(services: [GoodService(), BadService()], gracefulShutdownSignals: [.sigterm], logger: logger)

do {
    logger.info("Starting Server on port \(port) with pid \(pid ?? "unspecified")")
    try await serviceGroup.run()
} catch {
    logger.error("Exception while running: \(error.localizedDescription)")
}
```

**Scenario: BadService throws an exception**: ServiceLifecycle catches the exception and cancels GoodService.  The exception is rethrown and my code catches it.  ✅

**Scenario: GoodService throws an exception**: ServiceLifecycle catches the exception and tries to cancel BadService, which ignores the cancellation.  The ServiceGroup continues to wait indefinitely.  No errors are logged or exceptions thrown.  The server continue to run.  ❎❎

Obviously, the second case would benefit from using `ServiceGroupConfiguration` and setting the `maximumCancellationDuration` and `maximumGracefulShutdownDuration`.  It might be worth calling attention to the existence of the configuration in either the readme or "How to adopt ServiceLifecycle in ____" articles.  

**Scenario: With durations specifed, GoodService throws an exception**: Now the server exits with "Fatal error: Cancellation took longer than allowed by the configuration." which is better, but the original exception is lost so important information for debugging is lost. ❎

**Scenario: I have two good services and both throw an exception**: From my read of the code, only one exception is passed to the application. Again, potentially valuable debugging information is lost. ❎

Assuming everything I've said above is correct or mostly correct, let me bravely attempt to suggest a correction.  A logger is already being passed to `ServiceGroup`.  May I suggest that
- Upon entering run(), a message be logged with the number of services (perhaps .info).  This is optional as it could be done instead by the application author.
- Upon cancellation, a message be logged indicating that cancellation has been requested (perhaps .info).  Again optional if triggered by the application author.  But if ServiceGroup is initiating the cancellation, I'd like to know why.
- Upon encountering an exception, that exception should be logged immediately with a message that the other members of the group are being cancelled (perhaps .error)
- If cancellation takes a long time (say a minute), a message be logged indicating which services are still running.  This could repeat or not.  Simply having it could be very helpful. (perhaps .notify)





Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Thoughts on failure reporting in ServiceGroup:run() #206

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Thoughts on failure reporting in ServiceGroup:run() #206

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions