Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 5, 2025

What do these changes do?

When the dask-scheduler is not responding in time due to timeouts the director-v2 would set the task as FAILED even when it is actually a success. This triggers this kind of error messages:
image

This PR changes that behavior by retrying to get the task results several times instead. it also introduces a new constant to define that time which is currently hard-coded (aka no new ENV).

This PR is the first step to improve the computational backend stability since a while. There will be following PRs to tackle:

  • DB access is currently too wild
  • in the use-case where many separate projects are created (metamodeling or other kinds of for loops), this creates too much trafic towards the dask-scheduler. this needs a few optimisations.
  • the distribution of jobs on workers is suboptimal

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 5, 2025
@sanderegg sanderegg self-assigned this Sep 5, 2025
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:computational clusters labels Sep 5, 2025
@codecov
Copy link

codecov bot commented Sep 5, 2025

Codecov Report

❌ Patch coverage is 84.37500% with 15 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.83%. Comparing base (a98b2fb) to head (fc6f56a).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8323      +/-   ##
==========================================
- Coverage   87.83%   87.83%   -0.01%     
==========================================
  Files        1945     1945              
  Lines       75397    75430      +33     
  Branches     1311     1311              
==========================================
+ Hits        66226    66254      +28     
- Misses       8776     8781       +5     
  Partials      395      395              
Flag Coverage Δ
integrationtests 64.09% <58.94%> (+0.05%) ⬆️
unittests 86.49% <84.37%> (-0.01%) ⬇️
Components Coverage Δ
pkg_aws_library 93.59% <ø> (ø)
pkg_celery_library 87.37% <ø> (ø)
pkg_dask_task_models_library 79.33% <0.00%> (-0.30%) ⬇️
pkg_models_library 93.15% <ø> (ø)
pkg_notifications_library 85.20% <ø> (ø)
pkg_postgres_database 88.02% <ø> (ø)
pkg_service_integration 70.19% <ø> (ø)
pkg_service_library 71.08% <ø> (ø)
pkg_settings_library 90.19% <ø> (ø)
pkg_simcore_sdk 85.03% <ø> (+0.05%) ⬆️
agent 93.53% <ø> (ø)
api_server 91.91% <ø> (ø)
autoscaling 95.77% <ø> (ø)
catalog 92.34% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 91.81% <ø> (-0.57%) ⬇️
datcore_adapter 97.94% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 90.94% <85.26%> (-0.03%) ⬇️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.46% <ø> (ø)
efs_guardian 89.62% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.61% <ø> (ø)
resource_usage_tracker 92.24% <ø> (+0.10%) ⬆️
storage 86.49% <ø> (ø)
webclient ∅ <ø> (∅)
webserver 88.00% <ø> (+0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a98b2fb...fc6f56a. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 5, 2025

🧪 CI Insights

Here's what we observed from your CI run for fc6f56a.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI unit-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse branch from aa2ee3e to 5fa7dac Compare September 5, 2025 15:20
@sanderegg sanderegg changed the title 🐛🎨⚗️Computational backend: Improvements on different matters 🐛🎨⚗️Computational backend: Stability (Step 1) Sep 8, 2025
@sanderegg sanderegg requested a review from Copilot September 8, 2025 07:43

This comment was marked as outdated.

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse branch 3 times, most recently from d9c1e1b to 8a082ac Compare September 8, 2025 13:44
@sanderegg sanderegg requested a review from Copilot September 8, 2025 14:20
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the stability of the computational backend by addressing timeout issues when retrieving task results from the dask-scheduler. Previously, when task result retrieval timed out, tasks would be immediately marked as FAILED even if they were actually successful. The new implementation introduces a retry mechanism that waits for a configurable period before marking tasks as failed.

Key changes:

  • Implements retry logic for task result retrieval with timeout handling
  • Adds new configuration setting for maximum wait time when retrieving results
  • Refactors task result processing to handle different error types appropriately

Reviewed Changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py Major refactoring of task result processing logic with new error handling and retry mechanisms
services/director-v2/src/simcore_service_director_v2/core/settings.py Adds new configuration setting for maximum wait time when retrieving results
services/director-v2/src/simcore_service_director_v2/modules/dask_client.py Updates logging and removes get_cluster_details method
services/director-v2/tests/unit/with_dbs/comp_scheduler/test_scheduler_dask.py Adds comprehensive test for the new retry behavior
services/director-v2/tests/unit/with_dbs/comp_scheduler/conftest.py Adds fixture for testing the new timeout configuration
services/director-v2/tests/unit/test_modules_dask_client.py Removes test for deleted get_cluster_details functionality
packages/dask-task-models-library/src/dask_task_models_library/plugins/task_life_cycle_worker_plugin.py Adds type assertion for worker instance

@sanderegg sanderegg marked this pull request as ready for review September 8, 2025 14:44
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thx so much!

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse branch 2 times, most recently from 4af263a to 45e2c7d Compare September 9, 2025 05:34
Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks a lot for the effort. I think this PR should close/resolve #7975 and ITISFoundation/osparc-issues#1952.

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse branch 2 times, most recently from 0cbd6d7 to 366ac4b Compare September 9, 2025 05:49
Copy link
Contributor

@wvangeit wvangeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Thanks

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse branch from a11cf98 to 19a833b Compare September 9, 2025 07:01
@sonarqubecloud
Copy link

sonarqubecloud bot commented Sep 9, 2025

@sanderegg sanderegg merged commit 59febda into ITISFoundation:master Sep 9, 2025
94 of 95 checks passed
@sanderegg sanderegg deleted the computational-backend/improvements-diverse branch September 9, 2025 08:42
@sanderegg sanderegg changed the title 🐛🎨⚗️Computational backend: Stability (Step 1) 🐛🎨⚗️Computational backend stability: improvements step 1 Sep 10, 2025
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Sep 19, 2025
65 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:director-v2 issue related with the director-v2 service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants