🐛Improvements on pipeline cancellation and ensure pipeline state is consistent #7996

sanderegg · 2025-06-26T14:09:49Z

Context

While testing following issues were identified:

in production, multiple computational pipelines of the same project were found (this should never happen),
this most likely made the computational scheduler manager stop scheduling
cancellation of computational pipelines showed an issue where changing some database rows took longer (maybe due to asyncpg migration), and a design issue where a DB row was changed after requesting scheduling to RabbitMQ was effectively commited to the DB after the scheduling was done, thus generating an incorrect DB row. This was inversed, which should make cancelling a pipeline much more reactive.
the pipeline state that is surfaced to the frontend was generated from the comp_tasks table due to the historical way we transmitted events from that table towards the frontend. This lead to inconsistencies where the frontend would already show the start button as enabled even though the computational scheduler was not finished scheduling. This would lead to have multiple times the same pipeline being scheduled, and this would lead to inconsistencies and some undefined behavior. This PR fixes this by always returning the pipeline state from the comp_runs table instead so that this becomes the ground truth.
To this end a new RabbitMQ event was introduced such that the frontend is notified when the computational scheduler changes the run state.

What do these changes do?

This pull request introduces significant changes to the computation pipeline management and its integration with RabbitMQ messaging. Key updates include the addition of a new RabbitMQ message type for pipeline status, refactoring of pipeline state handling, and improved notification mechanisms for computational pipeline updates.

RabbitMQ Integration Enhancements:

Added ComputationalPipelineStatusMessage class in models_library/rabbitmq_messages.py to represent pipeline status messages, including a routing key based on project_id.
Introduced publish_pipeline_scheduling_state function in utils/rabbitmq.py to send pipeline status updates via RabbitMQ.
Integrated a new RabbitMQ message parser _computational_pipeline_status_message_parser in notifications/_rabbitmq_exclusive_queue_consumers.py to handle pipeline status updates and notify users. [1] [2]

Pipeline State Refactoring:

Replaced task-based pipeline state computation with CompRunsRepository for determining pipeline state across multiple endpoints in computations.py. [1] [2] [3] [4]
Simplified pipeline state retrieval logic by directly accessing CompRunsRepository in various computation-related functions. [1] [2] [3]

Scheduler Improvements:

Enhanced _set_run_result in modules/comp_scheduler/_scheduler_base.py to publish pipeline completion logs and scheduling state updates.
Refactored _schedule_tasks_to_stop to handle tasks that can be instantly marked as aborted and return updated task states.

Code Cleanup:

Simplified _get_pipeline_at_db in modules/comp_scheduler/_manager.py by removing redundant variable assignments.
Removed outdated comments and redundant code in request_pipeline_scheduling in modules/comp_scheduler/_publisher.py. [1] [2]

Related issue/s

hopefully fixes Computational services are in WAITING_FOR_CLUSTER forever #7983
fixes Job state returns UNKNOWN #7985

How to test

Dev-ops

codecov · 2025-06-26T14:11:28Z

Codecov Report

Attention: Patch coverage is 80.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.38%. Comparing base (6b1b81e) to head (54551df).
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #7996      +/-   ##
==========================================
+ Coverage   87.39%   89.38%   +1.98%     
==========================================
  Files        1578     1379     -199     
  Lines       62317    57741    -4576     
  Branches     1011      477     -534     
==========================================
- Hits        54464    51610    -2854     
+ Misses       7541     6010    -1531     
+ Partials      312      121     -191

Flag	Coverage Δ
integrationtests	`64.25% <74.54%> (+0.09%)`	⬆️
unittests	`87.68% <66.66%> (+1.87%)`	⬆️

Components	Coverage Δ
api	`∅ <ø> (∅)`
pkg_aws_library	`∅ <ø> (∅)`
pkg_celery_library	`∅ <ø> (∅)`
pkg_dask_task_models_library	`∅ <ø> (∅)`
pkg_models_library	`93.27% <80.00%> (∅)`
pkg_notifications_library	`∅ <ø> (∅)`
pkg_postgres_database	`∅ <ø> (∅)`
pkg_service_integration	`69.92% <ø> (ø)`
pkg_service_library	`∅ <ø> (∅)`
pkg_settings_library	`∅ <ø> (∅)`
pkg_simcore_sdk	`84.99% <ø> (-0.06%)`	⬇️
agent	`96.29% <ø> (ø)`
api_server	`92.64% <ø> (ø)`
autoscaling	`96.03% <ø> (ø)`
catalog	`92.29% <ø> (ø)`
clusters_keeper	`99.13% <ø> (ø)`
dask_sidecar	`∅ <ø> (∅)`
datcore_adapter	`97.94% <ø> (ø)`
director	`76.73% <ø> (ø)`
director_v2	`91.04% <86.00%> (-0.02%)`	⬇️
dynamic_scheduler	`∅ <ø> (∅)`
dynamic_sidecar	`88.33% <ø> (-1.77%)`	⬇️
efs_guardian	`89.65% <ø> (∅)`
invitations	`93.60% <ø> (ø)`
payments	`92.57% <ø> (ø)`
resource_usage_tracker	`89.05% <ø> (-0.11%)`	⬇️
storage	`86.35% <ø> (∅)`
webclient	`∅ <ø> (∅)`
webserver	`87.65% <20.00%> (+<0.01%)`	⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b1b81e...54551df. Read the comment docs.

🚀 New features to boost your workflow:

📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

wvangeit

Looks good, thanks for the effort @sanderegg.
Just a small remark. Isn't there a way to test some of this new functionality?

matusdrobuliak66

Maybe I missed it, but I don't see any backend protection. I know this shouldn't happen from the frontend anymore, but I would still add a check on the backend when a pipeline is about to be scheduled - to ensure there's not already one scheduled or running in the system. If there is, it should return an error.

Ah now I see it, this is it right?

    with contextlib.suppress(ComputationalRunNotFoundError):
        last_run = await comp_runs_repo.get(
            user_id=computation.user_id, project_id=computation.project_id
        )
        pipeline_state = last_run.result

        if utils.is_pipeline_running(pipeline_state):
            raise HTTPException(
                status_code=status.HTTP_409_CONFLICT,
                detail=f"Project {computation.project_id} already started, current state is {pipeline_state}",
            )

matusdrobuliak66

Thanks looks good! 👍 Maybe in the next iterations you can start to use the new comp_runs_snapshot_tasks

odeimaiz

Thanks!

Let's check tomorrow how these states are mapped to the Start/Stop buttons.

pcrespov

Thx. Good analysis!

...eb/server/src/simcore_service_webserver/notifications/_rabbitmq_exclusive_queue_consumers.py

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_publisher.py

sanderegg · 2025-06-27T06:52:30Z

Looks good, thanks for the effort @sanderegg. Just a small remark. Isn't there a way to test some of this new functionality?

@wvangeit
There are actually no new functionalities here. more of improving old code that grew a bit inconsistent with chosen technologies and multiple edits. tests are already existing. I will add some for the rabbitmq messages though.

sanderegg · 2025-06-27T06:54:40Z

Maybe I missed it, but I don't see any backend protection. I know this shouldn't happen from the frontend anymore, but I would still add a check on the backend when a pipeline is about to be scheduled - to ensure there's not already one scheduled or running in the system. If there is, it should return an error.

Ah now I see it, this is it right?
    with contextlib.suppress(ComputationalRunNotFoundError):
        last_run = await comp_runs_repo.get(
            user_id=computation.user_id, project_id=computation.project_id
        )
        pipeline_state = last_run.result

        if utils.is_pipeline_running(pipeline_state):
            raise HTTPException(
                status_code=status.HTTP_409_CONFLICT,
                detail=f"Project {computation.project_id} already started, current state is {pipeline_state}",
            )

@matusdrobuliak66 there were actually already protections. The problem here originated in the fact that comp_runs came much later than comp_tasks/comp_pipeline and that the webserver observes comp_tasks table and that is locked at least until all legacy dynamic services are gone. The comp_runs result column was not completely in sync and that is basically the problem.

bisgaard-itis

Thanks for the fix

publish event on pipeline status update

…han db change

sanderegg · 2025-06-27T13:07:56Z

@mergify queue

mergify · 2025-06-27T13:08:09Z

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 2f1b484

sonarqubecloud · 2025-06-27T13:09:04Z

Quality Gate passed

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

…onsistent (#7996)

sanderegg added this to the Engage milestone Jun 26, 2025

sanderegg self-assigned this Jun 26, 2025

sanderegg added a:webserver webserver's codebase. Assigning the area is particularly useful for bugs a:director-v2 issue related with the director-v2 service labels Jun 26, 2025

sanderegg marked this pull request as ready for review June 26, 2025 14:16

sanderegg requested review from GitHK, giancarloromeo, matusdrobuliak66 and pcrespov as code owners June 26, 2025 14:16

sanderegg requested review from bisgaard-itis, odeimaiz and wvangeit June 26, 2025 14:16

wvangeit approved these changes Jun 26, 2025

View reviewed changes

matusdrobuliak66 reviewed Jun 26, 2025

View reviewed changes

matusdrobuliak66 approved these changes Jun 26, 2025

View reviewed changes

odeimaiz approved these changes Jun 26, 2025

View reviewed changes

pcrespov approved these changes Jun 27, 2025

View reviewed changes

...eb/server/src/simcore_service_webserver/notifications/_rabbitmq_exclusive_queue_consumers.py Outdated Show resolved Hide resolved

services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_publisher.py Show resolved Hide resolved

sanderegg mentioned this pull request Jun 27, 2025

Analyze impact of upgrade to asyncpg from aiopg #7829

Closed

sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 93ed455 to 27eda3e Compare June 27, 2025 06:24

bisgaard-itis approved these changes Jun 27, 2025

View reviewed changes

sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 27eda3e to 38f0943 Compare June 27, 2025 08:01

giancarloromeo approved these changes Jun 27, 2025

View reviewed changes

bisgaard-itis approved these changes Jun 27, 2025

View reviewed changes

sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 38f0943 to 55a9b2d Compare June 27, 2025 12:05

sanderegg added 20 commits June 27, 2025 15:07

instantly stop non started tasks

ea0eb63

publish event on pipeline status update

fix and logs

35c6a85

ruff

5dd28ce

first set the database to be scheduled cause scheduling goes faster t…

d17b24b

…han db change

added subscription to computational pipeline events

ea5db46

send socket event

a8f940a

rename

bb1eaa7

add note

420dfc9

@pcrespov review: removed warning

961f82a

ongoing tests

ee2c156

fixed test. not started pipeline now returns not started

cd94eed

fixed update of state

3b247c7

only update if needed

0b119c6

improving tests

39e0e32

fixing

9a929bd

fixed tests

3bb65c8

added run twice protection

857a31c

added test to prevent duplication in comp_runs

38431d8

linter

bf16bc0

linter

54551df

sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 9fddcb5 to 54551df Compare June 27, 2025 13:07

mergify bot merged commit 2f1b484 into ITISFoundation:master Jun 27, 2025
147 of 152 checks passed

sanderegg added the release Preparation for pre-release/release label Jun 30, 2025

sanderegg mentioned this pull request Jul 1, 2025

🐛E2E: check for NOT_STARTED state instead of UNKNOWN #8024

Merged

sanderegg added a commit that referenced this pull request Jul 3, 2025

🐛Improvements on pipeline cancellation and ensure pipeline state is c…

08fcbd1

…onsistent (#7996)

matusdrobuliak66 mentioned this pull request Jul 2, 2025

🔥 Hotfix v1.83.x #8003

Closed

matusdrobuliak66 mentioned this pull request Aug 5, 2025

🚀 Release v1.84.0 #7947

Closed

88 tasks

🐛Improvements on pipeline cancellation and ensure pipeline state is consistent #7996

🐛Improvements on pipeline cancellation and ensure pipeline state is consistent #7996

Uh oh!

Conversation

sanderegg commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Context

What do these changes do?

RabbitMQ Integration Enhancements:

Pipeline State Refactoring:

Scheduler Improvements:

Code Cleanup:

Related issue/s

How to test

Dev-ops

Uh oh!

codecov bot commented Jun 26, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

wvangeit left a comment

Choose a reason for hiding this comment

Uh oh!

matusdrobuliak66 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

matusdrobuliak66 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

odeimaiz left a comment

Choose a reason for hiding this comment

Uh oh!

pcrespov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

sanderegg commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

sanderegg commented Jun 27, 2025

Uh oh!

bisgaard-itis left a comment

Choose a reason for hiding this comment

Uh oh!

sanderegg commented Jun 27, 2025

Uh oh!

mergify bot commented Jun 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ The pull request has been merged automatically

Uh oh!

sonarqubecloud bot commented Jun 27, 2025

Quality Gate passed

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

sanderegg commented Jun 26, 2025 •

edited

Loading

codecov bot commented Jun 26, 2025 •

edited

Loading

matusdrobuliak66 left a comment •

edited

Loading

matusdrobuliak66 left a comment •

edited

Loading

sanderegg commented Jun 27, 2025 •

edited

Loading

mergify bot commented Jun 27, 2025 •

edited

Loading