Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented May 15, 2025

What do these changes do?

This pull request introduces a task lifecycle tracking system for Dask. The changes include the addition of plugins for tracking task states, configuration updates for Dask workers and the scheduler for testing.

This should fix the issue described in #7629 by tracking the state of each dask task using dask plugins facilities. Instead of relying solely on events that might be missed, the state of the tasks is kept in the dask-scheduler and can now be retrieved from the director-v2 at will. That information is kept in the dask-scheduler until it gets switched off.

Task Lifecycle Tracking

  • Added TaskLifecycleSchedulerPlugin and TaskLifecycleWorkerPlugin to track task state transitions on the scheduler and workers, respectively. These plugins log lifecycle events using the TaskLifeCycleState model.
  • Introduced mappings between Dask task states and RunningState to standardize state representation.

Dask Configuration Updates

  • Updated Dask worker configurations to preload the TaskLifecycleWorkerPlugin for all worker types in tests.
  • Updated the Dask scheduler configuration to preload the TaskLifecycleSchedulerPlugin in tests.

Test and Dependency Updates

  • Updated tests to reflect the new plugin structure and lifecycle tracking functionality.
  • Added simcore-dask-task-models-library as a dependency in ci.txt and dev.txt for the clusters-keeper service for testing purpose.

Miscellaneous Improvements

  • Enhanced the _needs_manual_intervention function of the autoscaling monitoring script to include a time-based condition for container handling.

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Bazinga! milestone May 15, 2025
@sanderegg sanderegg self-assigned this May 15, 2025
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:computational clusters labels May 15, 2025
@sanderegg sanderegg requested a review from Copilot May 15, 2025 16:15
@sanderegg sanderegg force-pushed the dask-sidecar/task-lifecycle-events branch from 6857668 to a954c4a Compare May 15, 2025 16:18
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces Dask plugins to publish task lifecycle events and integrates them into the worker and scheduler setups, along with corresponding test scaffolding and minor scripting adjustments.

  • Adds TaskLifecycleWorkerPlugin and TaskLifecycleSchedulerPlugin to emit lifecycle events for each task.
  • Integrates the new plugins into dask_setup for both worker and scheduler, updating error handling and teardown logs.
  • Updates tests to cover the new plugins (though some tests are still stubs) and adjusts the LocalCluster fixture to preload the scheduler plugin.
  • Tweaks a manual-intervention check in an autoscaling SSH script to include a time-based condition.

Reviewed Changes

Copilot reviewed 8 out of 8 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
services/dask-sidecar/tests/unit/test_scheduler.py Adds basic tests for task lifecycle event emission (with debug prints)
services/dask-sidecar/tests/unit/test_rabbitmq_plugin.py Adds a stub test and a failure scenario for RabbitMQ plugin init
services/dask-sidecar/tests/unit/conftest.py Configures LocalCluster to preload the scheduler plugin
services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py Registers and handles errors for TaskLifecycleWorkerPlugin
services/dask-sidecar/src/simcore_service_dask_sidecar/scheduler.py Registers TaskLifecycleSchedulerPlugin and updates teardown log
services/dask-sidecar/src/simcore_service_dask_sidecar/task_life_cycle_worker_plugin.py New worker plugin emitting task transitions
services/dask-sidecar/src/simcore_service_dask_sidecar/task_life_cycle_scheduler_plugin.py New scheduler plugin emitting task transitions
scripts/maintenance/computational-clusters/autoscaled_monitor/ssh.py Extends manual-intervention logic with a 2-minute age check
Comments suppressed due to low confidence (4)

services/dask-sidecar/tests/unit/test_rabbitmq_plugin.py:21

  • This test is currently just a stub (ellipsis) and has no assertions; implement its body or remove it to avoid misleading coverage results.
def test_rabbitmq_plugin_initializes(dask_client: distributed.Client): ...

services/dask-sidecar/tests/unit/test_rabbitmq_plugin.py:33

  • The test only sleeps and does not assert any outcome; add assertions to verify that the worker actually closes or raises the expected exception.
async def test_dask_worker_closes_if_plugin_fails_on_start(

scripts/maintenance/computational-clusters/autoscaled_monitor/ssh.py:269

  • The time comparison containers[0].created_at - now > 2min is inverted. To check if a container is older than 2 minutes, use arrow.utcnow().datetime - containers[0].created_at > timedelta(...).
needs_manual_intervention=_needs_manual_intervention(containers) and ((containers[0].created_at - arrow.utcnow().datetime) > datetime.timedelta(minutes=2))

services/dask-sidecar/src/simcore_service_dask_sidecar/worker.py:83

  • The bare raise is outside the except block due to mis-indentation; this will cause a syntax error. It should be indented under the except or replaced with a specific exception.
raise

@codecov
Copy link

codecov bot commented May 15, 2025

Codecov Report

Attention: Patch coverage is 57.34266% with 61 lines in your changes missing coverage. Please review.

Project coverage is 87.95%. Comparing base (01eeeaf) to head (1567bdd).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7686      +/-   ##
==========================================
+ Coverage   87.56%   87.95%   +0.38%     
==========================================
  Files        1814     1753      -61     
  Lines       70414    67136    -3278     
  Branches     1144     1144              
==========================================
- Hits        61659    59047    -2612     
+ Misses       8443     7777     -666     
  Partials      312      312              
Flag Coverage Δ
integrationtests 64.40% <83.63%> (+0.01%) ⬆️
unittests 86.45% <52.44%> (-0.34%) ⬇️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library 93.92% <ø> (ø)
pkg_dask_task_models_library 79.47% <26.76%> (-19.01%) ⬇️
pkg_models_library 93.07% <ø> (ø)
pkg_notifications_library 85.26% <ø> (ø)
pkg_postgres_database 88.58% <ø> (ø)
pkg_service_integration 69.92% <ø> (ø)
pkg_service_library 72.36% <ø> (ø)
pkg_settings_library 90.90% <ø> (ø)
pkg_simcore_sdk 85.07% <ø> (+0.17%) ⬆️
agent 96.46% <ø> (ø)
api_server 91.68% <ø> (ø)
autoscaling ∅ <ø> (∅)
catalog 92.70% <ø> (ø)
clusters_keeper 99.25% <ø> (ø)
dask_sidecar 91.67% <76.47%> (+1.81%) ⬆️
datcore_adapter 98.12% <ø> (ø)
director 76.78% <ø> (ø)
director_v2 91.04% <90.90%> (-0.17%) ⬇️
dynamic_scheduler 96.76% <ø> (ø)
dynamic_sidecar 90.17% <ø> (-0.02%) ⬇️
efs_guardian 89.79% <ø> (ø)
invitations 93.28% <ø> (ø)
payments 92.63% <ø> (ø)
resource_usage_tracker 89.13% <ø> (+0.10%) ⬆️
storage 87.56% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.00% <ø> (+2.18%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 01eeeaf...1567bdd. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sanderegg sanderegg force-pushed the dask-sidecar/task-lifecycle-events branch 6 times, most recently from bf97c8f to 3764bf7 Compare May 19, 2025 14:38
@sanderegg sanderegg requested a review from Copilot May 19, 2025 14:52
@sanderegg sanderegg marked this pull request as ready for review May 19, 2025 14:52
@sanderegg sanderegg requested review from GitHK and pcrespov as code owners May 19, 2025 14:52
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This pull request introduces a task lifecycle tracking system for Dask by adding worker and scheduler plugins, updating configurations to preload these plugins, and refactoring RabbitMQ integration to clarify its semantics.

  • Introduces TaskLifecycleWorkerPlugin and TaskLifecycleSchedulerPlugin for tracking task state transitions.
  • Updates Dask worker and scheduler configurations to integrate lifecycle plugins.
  • Refactors RabbitMQ plugin naming and adjusts test setups to reflect new plugin structure.

Reviewed Changes

Copilot reviewed 20 out of 24 changed files in this pull request and generated no comments.

Show a summary per file
File Description
services/director-v2/modules/comp_scheduler/_scheduler_base.py Modified task state triage and introduction of tuple return for changed and executing tasks.
services/director-v2/models/dask_subsystem.py Cleanup: removed outdated Dask client task state enum.
services/dask-sidecar/* Updated plugin imports and improved error messaging; added lifecycle event publishing support.
scripts/maintenance/computational-clusters/autoscaled_monitor/ssh.py Enhanced manual intervention check with a time-based condition.
packages/dask-task-models-library/* Added new plugins for worker and scheduler with updated mappings for task states.
Files not reviewed (4)
  • services/clusters-keeper/requirements/ci.txt: Language not supported
  • services/clusters-keeper/requirements/dev.txt: Language not supported
  • services/director-v2/requirements/_test.in: Language not supported
  • services/director-v2/requirements/_test.txt: Language not supported
Comments suppressed due to low confidence (3)

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py:332

  • The _get_changed_tasks_from_backend function now returns a tuple instead of a single list. Please ensure that any consumers of this API are updated and that the change is clearly documented.
return ( [ ... ], [ ... ] ),

scripts/maintenance/computational-clusters/autoscaled_monitor/ssh.py:268

  • [nitpick] The additional time-based condition may be too strict if container creation times or timezone handling are inconsistent; please verify that the datetime comparison handles timezones correctly.
needs_manual_intervention=_needs_manual_intervention(containers) and ( (containers[0].created_at - arrow.utcnow().datetime) > datetime.timedelta(minutes=2) ),

services/dask-sidecar/src/simcore_service_dask_sidecar/rabbitmq_worker_plugin.py:101

  • [nitpick] Consider revising the warning message for clarity and professionalism by removing informal language such as 'Beware!' and clarifying any pytest-specific behavior.
_logger.warning("RabbitMQ client plugin setup is not the main thread! Beware! if in pytest it's ok.")

@sanderegg sanderegg force-pushed the dask-sidecar/task-lifecycle-events branch from cfe7113 to 81ec393 Compare May 19, 2025 15:28
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sanderegg sanderegg force-pushed the dask-sidecar/task-lifecycle-events branch from 367f0cc to 1567bdd Compare May 20, 2025 07:55
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx!

@sonarqubecloud
Copy link

@sanderegg sanderegg merged commit c4a6124 into ITISFoundation:master May 20, 2025
90 of 95 checks passed
@sanderegg sanderegg deleted the dask-sidecar/task-lifecycle-events branch May 20, 2025 08:24
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Jun 6, 2025
92 tasks
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Aug 5, 2025
88 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:dask-service Any of the dask services: dask-scheduler/sidecar or worker a:director-v2 issue related with the director-v2 service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

After a long time, computational run starting time point is not transmitted

4 participants