Use more aggressive server shutdown and resequence termination #112
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Using
Shutdownwith an already cancelled context will cause the method to return almost immediately, with only idle connections being closed. More problematically, it can't close active connections, which can remain in flight indefinitely. At the moment, these active connections (and their associated contexts) can causeloader.load()to blockloader.run()from exiting, especially if a backend is misbehaving, which can cause shutdown to halt waiting on the request. Even if a backend isn't misbehaving, an inference request can take many seconds. The best solution would be to makeloader.load()unblock if the context passed toloader.run()is cancelled, but this is fairly complicated to implemented. The easier solution for now is just to use a hard serverClose()to cancel inflight requests (and their contexts) and then wait for scheduler shutdown. This is what we do in Docker Desktop.