[SERVE] Parameterize app deletion timeout for ray serve #55542

ok-scale · 2025-08-12T19:29:00Z

Why are these changes needed?

The problem: Tests were failing with "applications weren't deleted after 60s" because in certain deployment modes, Ray Serve cannot forcefully kill replicas - it must wait for all requests to complete and honor a mandatory draining period.
Why some modes are different: When replicas are exposed to external load balancers (AWS ALB, K8s Ingress), the system needs extra time for load balancers to detect unhealthy targets and stop routing traffic, unlike standard proxy mode where replicas can be force-killed.
The solution: I made the application deletion timeout configurable via the RAY_SERVE_APP_DELETION_TIMEOUT_S environment variable, defaulting to 60 seconds to preserve existing behavior.
Enabling different test configuration: Added this environment variable to the test configurations which can be set higher than 60 seconds allowing tests to run successfully across different deployment modes and configurations that may require varying cleanup times.
Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.

Related issue number

Checks

I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
I've run scripts/format.sh to lint the changes in this PR.
I've included any doc changes needed for https://docs.ray.io/en/master/.
- I've added any new APIs to the API Reference. For example, if I added a
  method in Tune, I've added it in doc/source/tune/api/ under the
  corresponding .rst file.
I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
Testing Strategy
- Unit tests
- Release tests
- This PR is not tested :(

Signed-off-by: omkar <[email protected]>

gemini-code-assist

Code Review

This pull request aims to address test failures caused by a fixed application deletion timeout. It introduces a new environment variable RAY_SERVE_APP_DELETION_TIMEOUT_S to make this timeout configurable.

My review has identified a critical issue: the configurable timeout has been implemented in the deploy_apps method, which handles application deployment, instead of the delete_apps method where the timeout is still hardcoded. This seems to contradict the stated goal of the PR.

Additionally, I've suggested a minor improvement to move the environment variable handling into constants.py to align with the project's coding style and improve maintainability. Please review the detailed comments.

python/ray/serve/_private/client.py

Signed-off-by: omkar <[email protected]>

python/ray/serve/_private/client.py

landscapepainter

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

Signed-off-by: omkar <[email protected]>

ok-scale · 2025-08-12T20:32:51Z

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

Yes, reason is the 60 seconds hardcoded value, which is now updated.

ok-scale · 2025-08-12T20:41:36Z

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

Actually we can use that env variable to have more than 60 seconds timeouts which will help you get rid of the delete_all_apps() errors.

zcin

Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.

I don't understand, are you planning to set RAY_SERVE_APP_DELETION_TIMEOUT_S to a higher value? if the issue is that requests are blocked, they will block forever, then after whatever timeout you set, will still fail the test?

fix: updated compaction delay

20dc60c

Signed-off-by: omkar <[email protected]>

ok-scale requested a review from a team as a code owner August 12, 2025 19:29

gemini-code-assist bot reviewed Aug 12, 2025

View reviewed changes

python/ray/serve/_private/client.py Outdated Show resolved Hide resolved

python/ray/serve/_private/client.py Outdated Show resolved Hide resolved

ok-scale added 2 commits August 12, 2025 19:30

fix: updated compaction delay

1f665ce

Signed-off-by: omkar <[email protected]>

fix: updated compaction delay

967a187

Signed-off-by: omkar <[email protected]>

ok-scale requested review from abrarsheikh and zcin August 12, 2025 19:34

abrarsheikh reviewed Aug 12, 2025

View reviewed changes

python/ray/serve/_private/client.py Outdated Show resolved Hide resolved

landscapepainter reviewed Aug 12, 2025

View reviewed changes

fix: updated compaction delay

5da0bc0

Signed-off-by: omkar <[email protected]>

ok-scale requested review from abrarsheikh and landscapepainter August 12, 2025 20:33

ok-scale added the go add ONLY when ready to merge, run all tests label Aug 12, 2025

zcin reviewed Aug 13, 2025

View reviewed changes

ok-scale closed this Aug 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[SERVE] Parameterize app deletion timeout for ray serve #55542

[SERVE] Parameterize app deletion timeout for ray serve #55542

Uh oh!

ok-scale commented Aug 12, 2025 •

edited

Loading

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

landscapepainter left a comment •

edited

Loading

Uh oh!

ok-scale commented Aug 12, 2025

Uh oh!

ok-scale commented Aug 12, 2025

Uh oh!

zcin left a comment

Uh oh!

Uh oh!

[SERVE] Parameterize app deletion timeout for ray serve #55542

[SERVE] Parameterize app deletion timeout for ray serve #55542

Uh oh!

Conversation

ok-scale commented Aug 12, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are these changes needed?

Related issue number

Checks

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

landscapepainter left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

ok-scale commented Aug 12, 2025

Uh oh!

ok-scale commented Aug 12, 2025

Uh oh!

zcin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

ok-scale commented Aug 12, 2025 •

edited

Loading

landscapepainter left a comment •

edited

Loading