Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented May 1, 2025

What do these changes do?

Dask deprecated their Pub/Sub mechanism and retired it from their code recently. We are using this mechanism for transfering logs and progress events from the dask-sidecar to the director-v2. Also logs are only forwarded to RabbitMQ from the director-v2, thus loading director-v2 for no added value.

Next step: removing the progress Pub/Sub mechanism

This PR aims at replacing this mechanism by:

  • Using RabbitMQ to directly send the logs from the dask-sidecar
  • This means that replicating the director-v2 no longer replicates the log messages of the computational services
  • The director-v2 no longer needs to forward the logs which should reduce its resource usage by a large amount

Details

  • creation of a RabbitMQPlugin for dask-workers that connects to 🐰 (will close the dask-worker if it can't so that docker restarts it)
  • refactored logging setup code for clarity
  • dask-sidecar publisher now sends logs straight to RabbitMQ (no more dask Pub which is deprecated and broken)
  • director-v2 does not get any of the dask-sidecar logs anymore
  • removes TaskLogEvent from dask-task-models-library, 'dask-sidecar', 'director-v2'
  • pydantic V2 json_schema_extra callable fixes to remove mypy ignores
  • refactors dask-sidecar modules to go in a utils subfolder
  • private clusters now require RabbitMQ settings in order to send logs (this requires ✨Computational clusters: connect autoscaling to RabbitMQ ⚠️ #7485 and Expose RabbitMQ for Computational clusters osparc-ops-environments#1030) - done via clusters-keeper

Bonus Fixes:

  • comp_runs started column not filled correctly (now filled when job starts instead of when the computation is created)
  • fixes autoscaling-monitor cli access to available_space when empty @YuryHrytsuk

Related issue/s

How to test

Dev-ops

  • dask-sidecar now uses the usual RABBIT_ ENVs variables to access the 🐰
  • rabbit service is now also part of the computational_services_subnet docker network such that dask-sidecar can access it

@sanderegg sanderegg added bug buggy, it does not work as expected a:director-v2 issue related with the director-v2 service a:computational clusters labels May 1, 2025
@sanderegg sanderegg added this to the Pauwel Kwak milestone May 1, 2025
@sanderegg sanderegg self-assigned this May 1, 2025
@codecov
Copy link

codecov bot commented May 1, 2025

Codecov Report

Attention: Patch coverage is 84.31373% with 32 lines in your changes missing coverage. Please review.

Project coverage is 88.43%. Comparing base (6b53689) to head (6bec045).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7621      +/-   ##
==========================================
+ Coverage   84.00%   88.43%   +4.42%     
==========================================
  Files        1788     1766      -22     
  Lines       68985    67869    -1116     
  Branches     1134     1134              
==========================================
+ Hits        57951    60020    +2069     
+ Misses      10722     7538    -3184     
+ Partials      312      311       -1     
Flag Coverage Δ
integrationtests 64.96% <83.33%> (+<0.01%) ⬆️
unittests 86.96% <84.31%> (+4.03%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library 93.92% <ø> (ø)
pkg_dask_task_models_library 98.48% <100.00%> (+1.10%) ⬆️
pkg_models_library 92.80% <ø> (ø)
pkg_notifications_library 85.26% <ø> (ø)
pkg_postgres_database 88.41% <ø> (ø)
pkg_service_integration 69.92% <ø> (ø)
pkg_service_library 73.02% <ø> (ø)
pkg_settings_library 90.90% <ø> (ø)
pkg_simcore_sdk 85.72% <ø> (+0.05%) ⬆️
agent 96.46% <ø> (ø)
api_server 92.50% <ø> (ø)
autoscaling 96.08% <ø> (ø)
catalog 92.64% <ø> (ø)
clusters_keeper 99.25% <ø> (ø)
dask_sidecar 89.88% <82.08%> (-1.49%) ⬇️
datcore_adapter 98.12% <ø> (ø)
director 76.80% <ø> (ø)
director_v2 91.14% <83.33%> (+<0.01%) ⬆️
dynamic_scheduler 96.76% <ø> (ø)
dynamic_sidecar 90.14% <ø> (-0.02%) ⬇️
efs_guardian 89.79% <ø> (ø)
invitations 93.28% <ø> (ø)
payments 92.63% <ø> (ø)
resource_usage_tracker 89.12% <ø> (-0.11%) ⬇️
storage 87.56% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.30% <ø> (+14.74%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b53689...6bec045. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sanderegg sanderegg changed the title 🐛Computational billing incorect 🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub May 2, 2025
@sanderegg sanderegg changed the title 🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub ♻️✨🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub May 2, 2025
@sanderegg sanderegg modified the milestones: Pauwel Kwak, Bazinga! May 6, 2025
@sanderegg sanderegg force-pushed the bugfix/billing branch 2 times, most recently from 63ff44a to 68d3802 Compare May 6, 2025 20:57
@sanderegg sanderegg changed the title ♻️✨🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub ♻️✨🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub for logs May 7, 2025
@sanderegg sanderegg changed the title ♻️✨🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub for logs ♻️✨🐛Dask-Sidecar: add RabbitMQ dependency and remove usage of deprecated Pub/Sub for logs 🚨🚨🚨 May 8, 2025
@sanderegg sanderegg marked this pull request as ready for review May 8, 2025 11:40
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR replaces the deprecated Dask Pub/Sub logging mechanism with a RabbitMQ-based implementation and refactors related settings, logging, and error handling. Key changes include updating TaskPublisher to use asynchronous RabbitMQ messaging for logs, revising settings import and initialization, and modifying test and deployment configurations accordingly.

Reviewed Changes

Copilot reviewed 45 out of 46 changed files in this pull request and generated no comments.

Show a summary per file
File Description
services/dask-sidecar/src/simcore_service_dask_sidecar/utils/dask.py Updated TaskPublisher to send logs directly via RabbitMQ instead of using TaskLogEvent.
services/dask-sidecar/src/simcore_service_dask_sidecar/settings.py Refactored settings to use ApplicationSettings and added RabbitMQ settings.
services/dask-sidecar/src/simcore_service_dask_sidecar/scheduler.py Adjusted scheduler logging and settings retrieval.
services/dask-sidecar/src/simcore_service_dask_sidecar/rabbitmq_plugin.py Introduced RabbitMQPlugin for managing RabbitMQ connectivity in a dask-worker.
Other files Minor updates for asynchronous usage and configuration consistency across the codebase.
Files not reviewed (1)
  • services/dask-sidecar/docker/boot.sh: Language not supported
Comments suppressed due to low confidence (1)

services/dask-sidecar/src/simcore_service_dask_sidecar/utils/dask.py:110

  • When publishing parent logs, the wrong message object is passed; it should use 'parent_message' instead of 'base_message'. Consider replacing 'base_message' with 'parent_message' in the call.
await rabbitmq_client.publish_message_from_any_thread(parent_message.channel_name, base_message)

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 👍
Q:

  • Is the worker the one who talks directly to Rabbit? (Is there a use case where the manager also needs to communicate?)
  • What happens if this is an external cluster? Will we still have access to Rabbit? Will it only work if Rabbit is exposed? So if we run Rabbit as part of our Simcore stack, it won’t work?

@sanderegg
Copy link
Member Author

sanderegg commented May 8, 2025

Thanks 👍 Q:

  • Is the worker the one who talks directly to Rabbit? (Is there a use case where the manager also needs to communicate?)
  • What happens if this is an external cluster? Will we still have access to Rabbit? Will it only work if Rabbit is exposed? So if we run Rabbit as part of our Simcore stack, it won’t work?

@matusdrobuliak66 as stated in the description:

Copy link
Member

@mrnicegyu11 mrnicegyu11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Read the description, thanks a lot. Already approving to unblock, ping me in case you'd like a thorough code review no problem:)

Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@sonarqubecloud
Copy link

sonarqubecloud bot commented May 8, 2025

@sanderegg sanderegg enabled auto-merge (squash) May 8, 2025 14:59
@sanderegg sanderegg disabled auto-merge May 8, 2025 15:09
@sanderegg sanderegg merged commit 25ce7b5 into ITISFoundation:master May 8, 2025
95 checks passed
@sanderegg sanderegg deleted the bugfix/billing branch May 8, 2025 15:45
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Jun 6, 2025
92 tasks
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Aug 5, 2025
88 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:director-v2 issue related with the director-v2 service bug buggy, it does not work as expected

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Dask sidecar: connect directly to RabbitMQ for logs

5 participants