-
Notifications
You must be signed in to change notification settings - Fork 32
🐛🎨Do not fail a pipeline when the clusters-keeper or the computational backend in general is not reachable for short time 🚨 #8286
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## master #8286 +/- ##
===========================================
- Coverage 86.62% 68.76% -17.87%
===========================================
Files 1940 755 -1185
Lines 75306 34787 -40519
Branches 1311 175 -1136
===========================================
- Hits 65237 23921 -41316
- Misses 9674 10809 +1135
+ Partials 395 57 -338
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
🧪 CI InsightsHere's what we observed from your CI run for 6e01185. 🟢 All jobs passed!But CI Insights is watching 👀 |
aad5768 to
cda0a87
Compare
9c9e448 to
f4e5d21
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR changes the computational scheduler's behavior when backend services (clusters-keeper, dask scheduler, or rabbitmq) are unreachable. Instead of immediately setting pipelines to FAILED state, they are now set to WAITING_FOR_CLUSTER state, preventing premature failures during service restarts or deployments.
Key changes:
- Modified exception handling to use WAITING_FOR_CLUSTER state instead of FAILED state for backend connectivity issues
- Added timeout mechanism to eventually fail pipelines that wait too long for cluster availability
- Updated test cases to reflect the new behavior and verify the timeout functionality
Reviewed Changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| test_scheduler_dask.py | Updated test to verify new WAITING_FOR_CLUSTER behavior and timeout functionality |
| conftest.py | Added fixture for configuring short cluster wait timeout in tests |
| dask.py | Added return type annotations and removed redundant error logging |
| _scheduler_base.py | Implemented core logic changes for handling backend connectivity issues with WAITING_FOR_CLUSTER state |
| settings.py | Added configuration setting for maximum cluster wait timeout |
services/director-v2/src/simcore_service_director_v2/core/settings.py
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thx
services/director-v2/src/simcore_service_director_v2/core/settings.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py
Outdated
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_base.py
Outdated
Show resolved
Hide resolved
4eb52bd to
efe65be
Compare
…ings.py Co-authored-by: Copilot <[email protected]>
2f4045f to
6e01185
Compare
|
@mergify queue |
🛑 Configuration not compatible with a branch protection settingThe branch protection setting |
|

What do these changes do?
This PR changes the behavior of the dv-2 computational scheduler:
BEFORE:
AFTER:
COMPUTATIONAL_BACKEND_MAX_WAITING_FOR_CLUSTER_TIMEOUTthat could be used in case we should change that value (currently defaulted to 10 minutes).tests driving changes:
test_scheduler_dask.pyRelated issue/s
How to test
Dev-ops