-
Notifications
You must be signed in to change notification settings - Fork 32
🐛🎨Computational backend stability: improvements step 2 #8341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
🐛🎨Computational backend stability: improvements step 2 #8341
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests.
Additional details and impacted files@@ Coverage Diff @@
## master #8341 +/- ##
===========================================
- Coverage 87.82% 68.71% -19.11%
===========================================
Files 1945 760 -1185
Lines 75526 34985 -40541
Branches 1312 175 -1137
===========================================
- Hits 66330 24041 -42289
- Misses 8801 10887 +2086
+ Partials 395 57 -338
Continue to review full report in Codecov by Sentry.
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR implements computational backend stability improvements by addressing issues identified through experiments that achieved a 1/499 failure rate. The changes focus on timeout adjustments and refactoring task state tracking to improve reliability.
Key changes:
- Reduced Dask timeout from 35 to 10 seconds for faster failure detection
- Introduced
TaskStateTrackermodel to better track task state transitions between previous and current states - Refactored task processing logic to use the new state tracking model for improved consistency
Reviewed Changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.
| File | Description |
|---|---|
dask_client.py |
Reduced default Dask timeout from 35s to 10s |
_scheduler_dask.py |
Updated task processing to use TaskStateTracker, improved state handling |
_scheduler_base.py |
Refactored task triaging and processing to use TaskStateTracker model |
_models.py |
Added TaskStateTracker dataclass to track task state transitions |
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py
Show resolved
Hide resolved
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_models.py
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, thanks again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks!
e194334 to
85dc2c4
Compare
|



What do these changes do?
After a new set of experiment, the SUCCESS rate increased dramatically to 1 FAILED/499 SUCCESS.
Nevertheless a few issues were identified and are fixed in this PR.
Related issue/s
How to test
Dev-ops