implementation of python async engine, which decouples receiving requ…#2758
Conversation
77f36db to
1af9265
Compare
10b9eb0 to
2d6aff2
Compare
2d6aff2 to
1bcd17b
Compare
| self.loop = None | ||
| # Todo: for async mode we should maybe consider | ||
|
|
||
| def receive_requests(self): |
There was a problem hiding this comment.
I'm wondering if we should add a graceful shutdown mechanism to the receive_requests and send_responses threads, right now it is unclear how these threads would behave if the main proc is abruptly shut down
There was a problem hiding this comment.
yes, definitely. We need graceful shutdown in a few places:
- Python Engine - I think if the python process crashes or is terminated, the on the front-end we should receive that signal and send a response for all incomplete outstanding requests. I'd probably put this logic in AsyncRequestManager, similar to how we have something for RollingBatch called shutdown
- The vllm_handler - in local testing, i've triggered a few unrecoverable engine errors that at least trigger a python restart, but don't clean up the memory/resources used by vllm. Would be great too if we have some sort of internal health check.
I think I can tackle 1 in this PR, but 2 may need some more thought.
There was a problem hiding this comment.
I need to do a bit more here, but for now what I've done is added a terminateInFlightRequests method that gets invoked when the python engine crashes. If there are requests being handled by the engine, and the engine crashes, then this ensures we send a response to the client.
The next biggest area I want to figure out is, how do we know when the python engine crashes? Right now we're only handling cases where the actual Connection dies, but if vLLM starts to hang, we don't have a timeout mechanism now.
I probably want to figure that out in a follow on PR, since the basic use-case here is working and the PR is already quite large. This feature is also currently guarded by a feature flag that is off by default. Thoughts @ethnzhng ?
There was a problem hiding this comment.
Sounds good. Some potential ideas for when we get to it:
- we can have a health state for the service, and base this off a combo of:
- timeouts for individual handler invocations
- how the overall queue is moving
- then we can have a graceful shutdown / cleanup path for the vLLM service, which we could also maybe use in the fatal error case
1bcd17b to
cc3e124
Compare
…ests and sending responses in the python engine
cc3e124 to
f075d61
Compare
…ests and sending responses in the python engine
Description
This PR implements an async/decoupled Python engine. This will allow for asynchronous code in the python handler, and enable integrations with LLM engines at a higher level than we do today. It's not exclusive to LLM integration, but that is the main purpose.
The goal of this implementation is to more cleanly integrate with LLM inference engines that have largely adopted the async/await paradigm in their public APIs. It will ideally replace the RollingBatch implementation we have today. Initial tests with vLLM 0.7.1 show a 5-7% performance improvement, all of which can be attributed to our current implementation.
When the frontend receives a request, it sends it to the Python backend and does not expect a response. When the Python backend completes a response, it sends it back to the frontend. The frontend maintains a mapping that allow it to associate the response from python with the client request so that it can send the response to the appropriate connection.
This has been tested via a unit test, and sample vLLM handler.
This is the initial PR, and improvements will be tackled in subsequent PRs. At the moment, this supports:
The follow up PRs will address:
Type of change
Please delete options that are not relevant.
Checklist:
pytest tests.py -k "TestCorrectnessLmiDist" -m "lmi_dist"Feature/Issue validation/testing
Please describe the Unit or Integration tests that you ran to verify your changes and relevant result summary. Provide instructions so it can be reproduced.
Please also list any relevant details for your test configuration.
Test A
Logs for Test A
Test B
Logs for Test B