Skip to content

Conversation

GitHK
Copy link
Contributor

@GitHK GitHK commented Oct 20, 2025

What do these changes do?

Changes how the stop procedure works. It no longer involves long running http connections.
Stop returns immediately and the status of the service is monitor until it becomes idle, marking the removal of the service.
This approach works for both legacy and new style dynamic services.

The dynamic-scheduler now creates a fire_and_forget task to deal with stopping legacy services, since it is still required to wait to up to 1 hour for these to finish stopping.

Related issue/s

How to test

Dev-ops

@GitHK GitHK self-assigned this Oct 20, 2025
@GitHK GitHK added this to the Imparable milestone Oct 20, 2025
Copy link

codecov bot commented Oct 20, 2025

Codecov Report

❌ Patch coverage is 68.62745% with 16 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.32%. Comparing base (a7cae35) to head (c86c138).

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8531      +/-   ##
==========================================
- Coverage   87.52%   87.32%   -0.21%     
==========================================
  Files        2009     1578     -431     
  Lines       78515    65660   -12855     
  Branches     1344      682     -662     
==========================================
- Hits        68721    57338   -11383     
+ Misses       9392     8082    -1310     
+ Partials      402      240     -162     
Flag Coverage Δ
integrationtests 63.90% <30.76%> (-0.05%) ⬇️
unittests 85.78% <68.62%> (-0.46%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 70.94% <ø> (ø)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.95% <ø> (ø)
agent 93.10% <ø> (ø)
api_server 91.62% <ø> (ø)
autoscaling 95.00% <ø> (ø)
catalog 92.06% <ø> (ø)
clusters_keeper 99.14% <ø> (ø)
dask_sidecar 92.38% <ø> (ø)
datcore_adapter 97.95% <ø> (ø)
director 75.72% <ø> (ø)
director_v2 90.88% <ø> (-0.02%) ⬇️
dynamic_scheduler 96.54% <81.57%> (-0.28%) ⬇️
dynamic_sidecar 90.44% <ø> (ø)
efs_guardian 89.83% <ø> (ø)
invitations 90.90% <ø> (ø)
payments 92.80% <ø> (ø)
resource_usage_tracker 91.79% <ø> (-0.43%) ⬇️
storage 86.97% <ø> (+0.08%) ⬆️
webclient ∅ <ø> (∅)
webserver 87.06% <30.76%> (-0.02%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a7cae35...c86c138. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Contributor

mergify bot commented Oct 20, 2025

🧪 CI Insights

Here's what we observed from your CI run for c86c138.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI system-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@GitHK GitHK added the t:maintenance Some planned maintenance work label Oct 20, 2025
@GitHK GitHK added the bug buggy, it does not work as expected label Oct 21, 2025
@GitHK GitHK requested a review from Copilot October 21, 2025 08:54
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR refactors the dynamic service stop mechanism to use an asynchronous polling approach instead of long-running HTTP connections. The stop operation now returns immediately and monitors the service status until it reaches an idle state. For legacy services that may take up to 1 hour to stop, the dynamic-scheduler creates fire-and-forget tasks to handle the stopping process without blocking.

Key changes:

  • Webserver now polls service status after initiating stop instead of waiting on a long HTTP connection
  • Dynamic-scheduler uses fire-and-forget tasks for legacy service stops that may take extended time
  • Removed timeout parameter from RPC interface as stops are now non-blocking

Reviewed Changes

Copilot reviewed 11 out of 11 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
services/web/server/src/simcore_service_webserver/dynamic_scheduler/api.py Implements polling mechanism to wait for service to become idle after stop request
services/dynamic-scheduler/tests/unit/api_rpc/test_api_rpc__services.py Removes timeout_s parameter from stop_dynamic_service test calls
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/fire_and_froget.py Adds new FireAndForgetCollection to manage background tasks
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/director_v2/_public_client.py Adds cleanup of public client from app state
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/common_interface.py Distinguishes legacy vs new-style services and uses fire-and-forget for legacy stops
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/services/catalog/_public_client.py Renames get_services_labels to get_service_labels
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/core/events.py Registers fire_and_forget_lifespan
services/dynamic-scheduler/src/simcore_service_dynamic_scheduler/api/frontend/routes_external_scheduler/_service.py Removes unused timeout_s parameter from service_stop
services/director-v2/src/simcore_service_director_v2/core/dynamic_services_settings/scheduler.py Updates comment to clarify timeout applies to LEGACY services
services/director-v2/src/simcore_service_director_v2/api/routes/dynamic_services.py Removes polling logic from stop endpoint as it's now handled by caller
packages/service-library/src/servicelib/rabbitmq/rpc_interfaces/dynamic_scheduler/services.py Removes timeout_s parameter from stop_dynamic_service RPC interface

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

Andrei Neagu added 2 commits October 21, 2025 10:58
Copy link

@GitHK GitHK requested a review from Copilot October 21, 2025 09:05
Copy link
Contributor

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 3 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

from .catalog._public_client import CatalogPublicClient
from .director_v2 import DirectorV2Client
from .service_tracker import set_request_as_running, set_request_as_stopped
from .fire_and_froget import FireAndForgetCollection
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import path contains typo: 'froget' should be 'forget'.

Suggested change
from .fire_and_froget import FireAndForgetCollection
from .fire_and_forget import FireAndForgetCollection

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still wrong

from ..services.deferred_manager import deferred_manager_lifespan
from ..services.director_v0 import director_v0_lifespan
from ..services.director_v2 import director_v2_lifespan
from ..services.fire_and_froget import fire_and_forget_lifespan
Copy link

Copilot AI Oct 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Import path contains typo: 'froget' should be 'forget'.

Suggested change
from ..services.fire_and_froget import fire_and_forget_lifespan
from ..services.fire_and_forget import fire_and_forget_lifespan

Copilot uses AI. Check for mistakes.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still mispelled (the first part)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🐸

@GitHK GitHK changed the title 🎨 Changed how service stop works 🎨 No more long running http requests while stopping services Oct 21, 2025
@GitHK GitHK marked this pull request as ready for review October 21, 2025 09:36
@GitHK GitHK requested a review from bisgaard-itis October 21, 2025 09:37
@GitHK GitHK requested a review from giancarloromeo October 21, 2025 09:37
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx. Left some comments.

I recommend you to cleanup naming typos, scheck suggestions and I would also make sure has a good test coverage

from ..services.deferred_manager import deferred_manager_lifespan
from ..services.director_v0 import director_v0_lifespan
from ..services.director_v2 import director_v2_lifespan
from ..services.fire_and_froget import fire_and_forget_lifespan
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still mispelled (the first part)

from .catalog._public_client import CatalogPublicClient
from .director_v2 import DirectorV2Client
from .service_tracker import set_request_as_running, set_request_as_stopped
from .fire_and_froget import FireAndForgetCollection
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is still wrong

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

name of this file "froget"



def _wait_for_idle_retry_error(node_id: NodeID, retry_state: RetryCallState):
_logger.info(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TIP: you are hooking logs to before_sleep and retry_error_callback

Note that tenacity has paramters for exaclyt that called: before_log, after_log and before_sleep_log

https://tenacity.readthedocs.io/en/latest/#before-and-after-retry-and-logging

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

tracked_service = await get_tracked_service(app, dynamic_service_stop.node_id)

if tracked_service and tracked_service.dynamic_service_start:
service_labels = await CatalogPublicClient.get_from_app_state(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if the service is there, why do you need to ask the catalog about these labels?

settings.DYNAMIC_SCHEDULER_STOP_SERVICE_TIMEOUT.total_seconds()
),
)
before_sleep=partial(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm you are maybe looking for before_sleep_log?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug buggy, it does not work as expected t:maintenance Some planned maintenance work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants