Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 17, 2025

What do these changes do?

  • The autoscaling service is asking the dask-backend to "retire workers" if possible
  • the dask-backend provides this API but it was observed that some workers do not have the time to access their jobs before being forcefully retired
  • --> this creates a situation where workers are added and removed continuously
  • --> this makes large computations sometimes less efficient as workers cannot take jobs since they are retired right away

This PR only will call that API when the autoscaling estimates that the current workers are not all needed to cover the jobs needs, making it less eager to remove workers.

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 17, 2025
@sanderegg sanderegg self-assigned this Sep 17, 2025
@sanderegg sanderegg added a:autoscaling autoscaling service in simcore's stack a:computational clusters labels Sep 17, 2025
@sanderegg sanderegg requested a review from Copilot September 17, 2025 06:49
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the autoscaling performance by making the system less eager to retire dask workers. The change only calls the dask retirement API when workers are actually not needed, preventing a situation where workers are continuously added and removed without having time to process jobs.

Key Changes

  • Modified _scale_down_unused_cluster_instances to conditionally call worker retirement based on task assignment status
  • Added comprehensive test coverage for the new retirement behavior

Reviewed Changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
_auto_scaling_core.py Added conditional logic to only retire nodes when no active nodes have assigned tasks
test_modules_cluster_scaling_computational.py Added spy fixture and assertions to verify retirement calls happen only when expected

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sonarqubecloud
Copy link

@codecov
Copy link

codecov bot commented Sep 17, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 67.71%. Comparing base (dd684b7) to head (dc1cb3b).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (dd684b7) and HEAD (dc1cb3b). Click for more details.

HEAD has 31 uploads less than BASE
Flag BASE (dd684b7) HEAD (dc1cb3b)
unittests 32 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #8374       +/-   ##
===========================================
- Coverage   87.89%   67.71%   -20.19%     
===========================================
  Files        1950      798     -1152     
  Lines       75907    37218    -38689     
  Branches     1336      175     -1161     
===========================================
- Hits        66718    25201    -41517     
- Misses       8791    11960     +3169     
+ Partials      398       57      -341     
Flag Coverage Δ
integrationtests 63.95% <ø> (+0.05%) ⬆️
unittests 95.78% <100.00%> (+9.21%) ⬆️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 76.63% <ø> (-8.36%) ⬇️
agent ∅ <ø> (∅)
api_server ∅ <ø> (∅)
autoscaling 95.78% <100.00%> (+<0.01%) ⬆️
catalog ∅ <ø> (∅)
clusters_keeper ∅ <ø> (∅)
dask_sidecar ∅ <ø> (∅)
datcore_adapter ∅ <ø> (∅)
director ∅ <ø> (∅)
director_v2 78.18% <ø> (-12.75%) ⬇️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 81.87% <ø> (-8.59%) ⬇️
efs_guardian ∅ <ø> (∅)
invitations ∅ <ø> (∅)
payments ∅ <ø> (∅)
resource_usage_tracker ∅ <ø> (∅)
storage ∅ <ø> (∅)
webclient ∅ <ø> (∅)
webserver 58.77% <ø> (-29.19%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update dd684b7...dc1cb3b. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! thx

@mergify
Copy link
Contributor

mergify bot commented Sep 17, 2025

🧪 CI Insights

Here's what we observed from your CI run for dc1cb3b.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI integration-tests Base branch is healthy, but retries were needed. Could be early signs of flakiness 👀 Healthy 1 View View
system-tests Base branch is healthy, but retries were needed. Could be early signs of flakiness 👀 Healthy 1 View View
unit-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@sanderegg sanderegg merged commit fea215e into ITISFoundation:master Sep 17, 2025
140 of 148 checks passed
@sanderegg sanderegg deleted the computational-backend/performance-improvements-step2 branch September 17, 2025 07:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:autoscaling autoscaling service in simcore's stack a:computational clusters

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants