-
Notifications
You must be signed in to change notification settings - Fork 6.8k
[SERVE] Parameterize app deletion timeout for ray serve #55542
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Signed-off-by: omkar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code Review
This pull request aims to address test failures caused by a fixed application deletion timeout. It introduces a new environment variable RAY_SERVE_APP_DELETION_TIMEOUT_S
to make this timeout configurable.
My review has identified a critical issue: the configurable timeout has been implemented in the deploy_apps
method, which handles application deployment, instead of the delete_apps
method where the timeout is still hardcoded. This seems to contradict the stated goal of the PR.
Additionally, I've suggested a minor improvement to move the environment variable handling into constants.py
to align with the project's coding style and improve maintainability. Please review the detailed comments.
Signed-off-by: omkar <[email protected]>
Signed-off-by: omkar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this change related to failing _shared_serve_instance.delete_all_apps()
from the serve_instance
fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance
is being cleaned after the end of the tests. Was wondering if it's related.
Signed-off-by: omkar <[email protected]>
Yes, reason is the |
Actually we can use that env variable to have more than 60 seconds timeouts which will help you get rid of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.
I don't understand, are you planning to set RAY_SERVE_APP_DELETION_TIMEOUT_S
to a higher value? if the issue is that requests are blocked, they will block forever, then after whatever timeout you set, will still fail the test?
Why are these changes needed?
The problem: Tests were failing with "applications weren't deleted after 60s" because in certain deployment modes, Ray Serve cannot forcefully kill replicas - it must wait for all requests to complete and honor a mandatory draining period.
Why some modes are different: When replicas are exposed to external load balancers (AWS ALB, K8s Ingress), the system needs extra time for load balancers to detect unhealthy targets and stop routing traffic, unlike standard proxy mode where replicas can be force-killed.
The solution: I made the application deletion timeout configurable via the
RAY_SERVE_APP_DELETION_TIMEOUT_S
environment variable, defaulting to 60 seconds to preserve existing behavior.Enabling different test configuration: Added this environment variable to the test configurations which can be set higher than 60 seconds allowing tests to run successfully across different deployment modes and configurations that may require varying cleanup times.
Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.