Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 9, 2025

What do these changes do?

After a new set of experiment, the SUCCESS rate increased dramatically to 1 FAILED/499 SUCCESS.
Nevertheless a few issues were identified and are fixed in this PR.

  • reducing time out time getting dask data
  • use limit_gather instead of gather and limit the amount of calls going down to dask backend
  • process_completed now reset to later state instead of back to STARTED as this creates a case where a non-started task (basically that went straight to SUCCESS) would fail creating heartbeats

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 9, 2025
@sanderegg sanderegg self-assigned this Sep 9, 2025
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:computational clusters labels Sep 9, 2025
@codecov
Copy link

codecov bot commented Sep 9, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.71%. Comparing base (9ebb830) to head (85dc2c4).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (9ebb830) and HEAD (85dc2c4). Click for more details.

HEAD has 31 uploads less than BASE
Flag BASE (9ebb830) HEAD (85dc2c4)
unittests 32 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #8341       +/-   ##
===========================================
- Coverage   87.82%   68.71%   -19.11%     
===========================================
  Files        1945      760     -1185     
  Lines       75526    34985    -40541     
  Branches     1312      175     -1137     
===========================================
- Hits        66330    24041    -42289     
- Misses       8801    10887     +2086     
+ Partials      395       57      -338     
Flag Coverage Δ
integrationtests 64.03% <100.00%> (+0.05%) ⬆️
unittests 84.54% <96.66%> (-1.96%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 76.95% <ø> (-8.08%) ⬇️
agent ∅ <ø> (∅)
api_server ∅ <ø> (∅)
autoscaling ∅ <ø> (∅)
catalog ∅ <ø> (∅)
clusters_keeper ∅ <ø> (∅)
dask_sidecar ∅ <ø> (∅)
datcore_adapter ∅ <ø> (∅)
director ∅ <ø> (∅)
director_v2 90.92% <100.00%> (+0.01%) ⬆️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 81.87% <ø> (-8.59%) ⬇️
efs_guardian ∅ <ø> (∅)
invitations ∅ <ø> (∅)
payments ∅ <ø> (∅)
resource_usage_tracker ∅ <ø> (∅)
storage ∅ <ø> (∅)
webclient ∅ <ø> (∅)
webserver 58.88% <ø> (-29.09%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9ebb830...85dc2c4. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 9, 2025

🧪 CI Insights

Here's what we observed from your CI run for 85dc2c4.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI unit-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@sanderegg sanderegg requested a review from Copilot September 9, 2025 17:47
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR implements computational backend stability improvements by addressing issues identified through experiments that achieved a 1/499 failure rate. The changes focus on timeout adjustments and refactoring task state tracking to improve reliability.

Key changes:

  • Reduced Dask timeout from 35 to 10 seconds for faster failure detection
  • Introduced TaskStateTracker model to better track task state transitions between previous and current states
  • Refactored task processing logic to use the new state tracking model for improved consistency

Reviewed Changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 2 comments.

File Description
dask_client.py Reduced default Dask timeout from 35s to 10s
_scheduler_dask.py Updated task processing to use TaskStateTracker, improved state handling
_scheduler_base.py Refactored task triaging and processing to use TaskStateTracker model
_models.py Added TaskStateTracker dataclass to track task state transitions

@sanderegg sanderegg marked this pull request as ready for review September 9, 2025 17:57
Copy link
Contributor

@wvangeit wvangeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks again.

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step2 branch from e194334 to 85dc2c4 Compare September 10, 2025 08:29
@sanderegg sanderegg merged commit 823c7e6 into ITISFoundation:master Sep 10, 2025
41 checks passed
@sanderegg sanderegg deleted the computational-backend/improvements-diverse-step2 branch September 10, 2025 08:30
@sonarqubecloud
Copy link

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:director-v2 issue related with the director-v2 service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants