-
Notifications
You must be signed in to change notification settings - Fork 32
🎨 No more long running http requests while stopping services #8531
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #8531 +/- ##
==========================================
- Coverage 87.52% 87.32% -0.21%
==========================================
Files 2009 1578 -431
Lines 78515 65660 -12855
Branches 1344 682 -662
==========================================
- Hits 68721 57338 -11383
+ Misses 9392 8082 -1310
+ Partials 402 240 -162
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR refactors the dynamic service stop mechanism to use an asynchronous polling approach instead of long-running HTTP connections. The stop operation now returns immediately and monitors the service status until it reaches an idle
state. For legacy services that may take up to 1 hour to stop, the dynamic-scheduler creates fire-and-forget tasks to handle the stopping process without blocking.
Key changes:
- Webserver now polls service status after initiating stop instead of waiting on a long HTTP connection
- Dynamic-scheduler uses fire-and-forget tasks for legacy service stops that may take extended time
- Removed timeout parameter from RPC interface as stops are now non-blocking
Reviewed Changes
Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.
Show a summary per file
File | Description |
---|---|
services/web/server/src/simcore_service_webserver/dynamic_scheduler/api.py | Implements polling mechanism to wait for service to become idle after stop request |
services/dynamic-scheduler/tests/unit/api_rpc/test_api_rpc__services.py | Removes timeout_s parameter from stop_dynamic_service test calls |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/fire_and_froget.py | Adds new FireAndForgetCollection to manage background tasks |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/director_v2/_public_client.py | Adds cleanup of public client from app state |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/common_interface.py | Distinguishes legacy vs new-style services and uses fire-and-forget for legacy stops |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/catalog/_public_client.py | Renames get_services_labels to get_service_labels |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/core/events.py | Registers fire_and_forget_lifespan |
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/api/frontend/routes_external_scheduler/_service.py | Removes unused timeout_s parameter from service_stop |
services/director-v2/src/simcore_service_director_v2/core/dynamic_services_settings/scheduler.py | Updates comment to clarify timeout applies to LEGACY services |
services/director-v2/src/simcore_service_director_v2/api/routes/dynamic_services.py | Removes polling logic from stop endpoint as it's now handled by caller |
packages/service-library/src/servicelib/rabbitmq/rpc_interfaces/dynamic_scheduler/services.py | Removes timeout_s parameter from stop_dynamic_service RPC interface |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/fire_and_froget.py
Outdated
Show resolved
Hide resolved
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/fire_and_froget.py
Show resolved
Hide resolved
...ler/src/simcore_service_dynamic_scheduler/api/frontend/routes_external_scheduler/_service.py
Outdated
Show resolved
Hide resolved
...ices/director-v2/src/simcore_service_director_v2/core/dynamic_services_settings/scheduler.py
Outdated
Show resolved
Hide resolved
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/fire_and_froget.py
Show resolved
Hide resolved
from .catalog._public_client import CatalogPublicClient | ||
from .director_v2 import DirectorV2Client | ||
from .service_tracker import set_request_as_running, set_request_as_stopped | ||
from .fire_and_froget import FireAndForgetCollection |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import path contains typo: 'froget' should be 'forget'.
from .fire_and_froget import FireAndForgetCollection | |
from .fire_and_forget import FireAndForgetCollection |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still wrong
from ..services.deferred_manager import deferred_manager_lifespan | ||
from ..services.director_v0 import director_v0_lifespan | ||
from ..services.director_v2 import director_v2_lifespan | ||
from ..services.fire_and_froget import fire_and_forget_lifespan |
Copilot
AI
Oct 21, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Import path contains typo: 'froget' should be 'forget'.
from ..services.fire_and_froget import fire_and_forget_lifespan | |
from ..services.fire_and_forget import fire_and_forget_lifespan |
Copilot uses AI. Check for mistakes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still mispelled (the first part)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🐸
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx. Left some comments.
I recommend you to cleanup naming typos, scheck suggestions and I would also make sure has a good test coverage
from ..services.deferred_manager import deferred_manager_lifespan | ||
from ..services.director_v0 import director_v0_lifespan | ||
from ..services.director_v2 import director_v2_lifespan | ||
from ..services.fire_and_froget import fire_and_forget_lifespan |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still mispelled (the first part)
from .catalog._public_client import CatalogPublicClient | ||
from .director_v2 import DirectorV2Client | ||
from .service_tracker import set_request_as_running, set_request_as_stopped | ||
from .fire_and_froget import FireAndForgetCollection |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is still wrong
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
name of this file "froget"
|
||
|
||
def _wait_for_idle_retry_error(node_id: NodeID, retry_state: RetryCallState): | ||
_logger.info( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TIP: you are hooking logs to before_sleep
and retry_error_callback
Note that tenacity has paramters for exaclyt that called: before_log
, after_log
and before_sleep_log
https://tenacity.readthedocs.io/en/latest/#before-and-after-retry-and-logging
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks
tracked_service = await get_tracked_service(app, dynamic_service_stop.node_id) | ||
|
||
if tracked_service and tracked_service.dynamic_service_start: | ||
service_labels = await CatalogPublicClient.get_from_app_state( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if the service is there, why do you need to ask the catalog about these labels?
settings.DYNAMIC_SCHEDULER_STOP_SERVICE_TIMEOUT.total_seconds() | ||
), | ||
) | ||
before_sleep=partial( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmm you are maybe looking for before_sleep_log?
What do these changes do?
Changes how the stop procedure works. It no longer involves long running http connections.
Stop returns immediately and the status of the service is monitor until it becomes
idle
, marking the removal of the service.This approach works for both legacy and new style dynamic services.
The
dynamic-scheduler
now creates a fire_and_forget task to deal with stopping legacy services, since it is still required to wait to up to 1 hour for these to finish stopping.Related issue/s
How to test
Dev-ops