🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🚨 #8286

sanderegg · 2025-09-01T15:37:10Z

What do these changes do?

This PR changes the behavior of the dv-2 computational scheduler:
BEFORE:

if the clusters-keeper is not reachable,
if the dask scheduler is not reachable,
if rabbitmq is not reachable,
--> the pipeline would be directly set to FAILED state

AFTER:

--> the pipeline will be set to WAITING_FOR_CLUSTER state

Since when we deploy or play around the clusters-keeper restarts wildly this will prevent failing pipelines straight away which were issues detected a while ago that should not be happening (see fixed issues).
Also, a new constant was defined COMPUTATIONAL_BACKEND_MAX_WAITING_FOR_CLUSTER_TIMEOUT that could be used in case we should change that value (currently defaulted to 10 minutes).

tests driving changes: test_scheduler_dask.py

Related issue/s

How to test

Dev-ops

codecov · 2025-09-01T15:54:36Z

Codecov Report

❌ Patch coverage is 88.70968% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 68.76%. Comparing base (c8f61c1) to head (6e01185).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (c8f61c1) and HEAD (6e01185). Click for more details.

HEAD has 30 uploads less than BASE

Flag BASE (c8f61c1) HEAD (6e01185)

unittests 31 1

Additional details and impacted files

@@             Coverage Diff             @@
##           master    #8286       +/-   ##
===========================================
- Coverage   86.62%   68.76%   -17.87%     
===========================================
  Files        1940      755     -1185     
  Lines       75306    34787    -40519     
  Branches     1311      175     -1136     
===========================================
- Hits        65237    23921    -41316     
- Misses       9674    10809     +1135     
+ Partials      395       57      -338

Flag	Coverage Δ
integrationtests	`64.07% <74.19%> (+<0.01%)`	⬆️
unittests	`84.58% <88.70%> (-2.14%)`	⬇️

Components	Coverage Δ
pkg_aws_library	`∅ <ø> (∅)`
pkg_celery_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`∅ <ø> (∅)`
pkg_models_library	`∅ <ø> (∅)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`∅ <ø> (∅)`
pkg_service_library	`∅ <ø> (∅)`
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`76.95% <ø> (-8.08%)`	⬇️
agent	`∅ <ø> (∅)`
api_server	`∅ <ø> (∅)`
autoscaling	`∅ <ø> (∅)`
catalog	`∅ <ø> (∅)`
clusters_keeper	`∅ <ø> (∅)`
dask_sidecar	`∅ <ø> (∅)`
datcore_adapter	`∅ <ø> (∅)`
director	`∅ <ø> (∅)`
director_v2	`91.04% <88.70%> (+12.80%)`	⬆️
dynamic_scheduler	`∅ <ø> (∅)`
dynamic_sidecar	`81.87% <ø> (-8.59%)`	⬇️
efs_guardian	`∅ <ø> (∅)`
invitations	`∅ <ø> (∅)`
payments	`∅ <ø> (∅)`
resource_usage_tracker	`∅ <ø> (∅)`
storage	`∅ <ø> (∅)`
webclient	`∅ <ø> (∅)`
webserver	`58.89% <ø> (-29.13%)`	⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update c8f61c1...6e01185. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

mergify · 2025-09-01T16:02:26Z

🧪 CI Insights

Here's what we observed from your CI run for 6e01185.

🟢 All jobs passed!

But CI Insights is watching 👀

matusdrobuliak66

Thanks!

Copilot

Pull Request Overview

This PR changes the computational scheduler's behavior when backend services (clusters-keeper, dask scheduler, or rabbitmq) are unreachable. Instead of immediately setting pipelines to FAILED state, they are now set to WAITING_FOR_CLUSTER state, preventing premature failures during service restarts or deployments.

Key changes:

Modified exception handling to use WAITING_FOR_CLUSTER state instead of FAILED state for backend connectivity issues
Added timeout mechanism to eventually fail pipelines that wait too long for cluster availability
Updated test cases to reflect the new behavior and verify the timeout functionality

Reviewed Changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
test_scheduler_dask.py	Updated test to verify new WAITING_FOR_CLUSTER behavior and timeout functionality
conftest.py	Added fixture for configuring short cluster wait timeout in tests
dask.py	Added return type annotations and removed redundant error logging
_scheduler_base.py	Implemented core logic changes for handling backend connectivity issues with WAITING_FOR_CLUSTER state
settings.py	Added configuration setting for maximum cluster wait timeout

services/director-v2/src/simcore_service_director_v2/core/settings.py

services/director-v2/tests/unit/with_dbs/comp_scheduler/conftest.py

wvangeit

Looks good, thanks.

pcrespov

thx

services/director-v2/src/simcore_service_director_v2/core/settings.py

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py

…ings.py Co-authored-by: Copilot <[email protected]>

sanderegg · 2025-09-04T06:58:19Z

@mergify queue

mergify · 2025-09-04T06:58:25Z

queue

🛑 Configuration not compatible with a branch protection setting

The branch protection setting Require branches to be up to date before merging is not compatible with max_parallel_checks>1, queue_conditions != merge_conditions and must be unset.

sonarqubecloud · 2025-09-04T06:58:49Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
No data about Coverage
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

sanderegg added this to the Cheops milestone Sep 1, 2025

sanderegg self-assigned this Sep 1, 2025

sanderegg added the a:director-v2 issue related with the director-v2 service label Sep 1, 2025

sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from aad5768 to cda0a87 Compare September 2, 2025 09:34

sanderegg changed the title ~~🐛Do not fail a pipeline directly when computational backend misses one scheduling session~~ 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable Sep 2, 2025

sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch 2 times, most recently from 9c9e448 to f4e5d21 Compare September 2, 2025 14:19

sanderegg requested a review from Copilot September 2, 2025 14:37

sanderegg changed the title ~~🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable~~ 🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time Sep 2, 2025

sanderegg requested review from GitHK, bisgaard-itis, matusdrobuliak66 and wvangeit September 2, 2025 14:41

sanderegg marked this pull request as ready for review September 2, 2025 14:41

sanderegg requested a review from pcrespov as a code owner September 2, 2025 14:41

sanderegg added the a:computational clusters label Sep 2, 2025

matusdrobuliak66 approved these changes Sep 2, 2025

View reviewed changes

Copilot AI reviewed Sep 2, 2025

View reviewed changes

services/director-v2/src/simcore_service_director_v2/core/settings.py Outdated Show resolved Hide resolved

services/director-v2/tests/unit/with_dbs/comp_scheduler/conftest.py Show resolved Hide resolved

wvangeit approved these changes Sep 2, 2025

View reviewed changes

pcrespov approved these changes Sep 2, 2025

View reviewed changes

GitHK approved these changes Sep 3, 2025

View reviewed changes

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py Outdated Show resolved Hide resolved

sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from 4eb52bd to efe65be Compare September 3, 2025 20:31

sanderegg added 5 commits September 4, 2025 08:57

refactor

6ca36ec

ongoing

73a81c6

simplify

bc16e9b

thinking

e0067ac

ensure if clusters keeper is not around it does not fail right away

e45f4de

sanderegg and others added 18 commits September 4, 2025 08:57

ruff

d016fcc

do not fail right away when the clusters-keeper restarts

e1205dd

cleanup

fb3005f

added variable for maximal timeout

9aee9e3

change default

f9e1ba3

add some fixes

aaac1f6

fixed test

b27519e

test fixed

f5904b7

time out before processing stuff so that failure happens right away

f9ab441

get the tasks after the update from the backend

32b99ea

Update services/director-v2/src/simcore_service_director_v2/core/sett…

13320c2

…ings.py Co-authored-by: Copilot <[email protected]>

@pcrespov review: use prompt twice

35b4a85

@pcrespov review: use create_troubleshooting_lg_kwargs

14041cd

use user_message

61b75ce

ai messages

452aee1

remove unnecessary pylint stuff

77c89ec

sql 2.0

7908307

sql 2.0

6e01185

sanderegg force-pushed the bugfix/8098/define-timeout-before-failing-pipelines branch from 2f4045f to 6e01185 Compare September 4, 2025 06:57

sanderegg added the 🤖-automerge marks PR as ready to be merged for Mergify label Sep 4, 2025

sanderegg merged commit 1a721f8 into ITISFoundation:master Sep 4, 2025
92 of 95 checks passed

sanderegg deleted the bugfix/8098/define-timeout-before-failing-pipelines branch September 4, 2025 07:31

sanderegg mentioned this pull request Sep 4, 2025

🐛Director-v2: fix empty arguments for max method #8308

Merged

pcrespov mentioned this pull request Sep 11, 2025

🚀 Pre-release master -> staging_Cheops4 (skipping 3) #8345

Closed

16 tasks

This was referenced Sep 18, 2025

🚀 Pre-release master -> staging_Cheops5 #8381

Closed

🚀 Release v1.86.0 #8338

Closed

🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🚨 #8286

🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🚨 #8286

Uh oh!

Conversation

sanderegg commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What do these changes do?

Related issue/s

How to test

Dev-ops

Uh oh!

codecov bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

mergify bot commented Sep 1, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🧪 CI Insights

🟢 All jobs passed!

Uh oh!

matusdrobuliak66 left a comment

Choose a reason for hiding this comment

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

wvangeit left a comment

Choose a reason for hiding this comment

Uh oh!

pcrespov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

sanderegg commented Sep 4, 2025

Uh oh!

mergify bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🛑 Configuration not compatible with a branch protection setting

Uh oh!

sonarqubecloud bot commented Sep 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

sanderegg commented Sep 1, 2025 •

edited

Loading

codecov bot commented Sep 1, 2025 •

edited

Loading

mergify bot commented Sep 1, 2025 •

edited

Loading

mergify bot commented Sep 4, 2025 •

edited

Loading

sonarqubecloud bot commented Sep 4, 2025 •

edited

Loading