Skip to content

Conversation

ok-scale
Copy link
Contributor

@ok-scale ok-scale commented Aug 12, 2025

Why are these changes needed?

  1. The problem: Tests were failing with "applications weren't deleted after 60s" because in certain deployment modes, Ray Serve cannot forcefully kill replicas - it must wait for all requests to complete and honor a mandatory draining period.

  2. Why some modes are different: When replicas are exposed to external load balancers (AWS ALB, K8s Ingress), the system needs extra time for load balancers to detect unhealthy targets and stop routing traffic, unlike standard proxy mode where replicas can be force-killed.

  3. The solution: I made the application deletion timeout configurable via the RAY_SERVE_APP_DELETION_TIMEOUT_S environment variable, defaulting to 60 seconds to preserve existing behavior.

  4. Enabling different test configuration: Added this environment variable to the test configurations which can be set higher than 60 seconds allowing tests to run successfully across different deployment modes and configurations that may require varying cleanup times.

  5. Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e., git commit -s) in this PR.
  • I've run scripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed for https://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it in doc/source/tune/api/ under the
      corresponding .rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures at https://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@ok-scale ok-scale requested a review from a team as a code owner August 12, 2025 19:29
Copy link
Contributor

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request aims to address test failures caused by a fixed application deletion timeout. It introduces a new environment variable RAY_SERVE_APP_DELETION_TIMEOUT_S to make this timeout configurable.

My review has identified a critical issue: the configurable timeout has been implemented in the deploy_apps method, which handles application deployment, instead of the delete_apps method where the timeout is still hardcoded. This seems to contradict the stated goal of the PR.

Additionally, I've suggested a minor improvement to move the environment variable handling into constants.py to align with the project's coding style and improve maintainability. Please review the detailed comments.

@ok-scale ok-scale requested review from abrarsheikh and zcin August 12, 2025 19:34
Copy link
Contributor

@landscapepainter landscapepainter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

@ok-scale
Copy link
Contributor Author

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

Yes, reason is the 60 seconds hardcoded value, which is now updated.

@ok-scale
Copy link
Contributor Author

Is this change related to failing _shared_serve_instance.delete_all_apps() from the serve_instance fixture used throughout our tests? I saw some tests erroring out due to failing to successfully run that method when serve_instance is being cleaned after the end of the tests. Was wondering if it's related.

Actually we can use that env variable to have more than 60 seconds timeouts which will help you get rid of the delete_all_apps() errors.

@ok-scale ok-scale added the go add ONLY when ready to merge, run all tests label Aug 12, 2025
Copy link
Contributor

@zcin zcin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this change helps: This minimal change keeps production code behavior unchanged while allowing tests to specify appropriate timeouts for their execution mode, solving the test failures without affecting normal Ray Serve operations.

I don't understand, are you planning to set RAY_SERVE_APP_DELETION_TIMEOUT_S to a higher value? if the issue is that requests are blocked, they will block forever, then after whatever timeout you set, will still fail the test?

@ok-scale ok-scale closed this Aug 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants