Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Jun 26, 2025

Context

While testing following issues were identified:

  • in production, multiple computational pipelines of the same project were found (this should never happen),
  • this most likely made the computational scheduler manager stop scheduling
  • cancellation of computational pipelines showed an issue where changing some database rows took longer (maybe due to asyncpg migration), and a design issue where a DB row was changed after requesting scheduling to RabbitMQ was effectively commited to the DB after the scheduling was done, thus generating an incorrect DB row. This was inversed, which should make cancelling a pipeline much more reactive.
  • the pipeline state that is surfaced to the frontend was generated from the comp_tasks table due to the historical way we transmitted events from that table towards the frontend. This lead to inconsistencies where the frontend would already show the start button as enabled even though the computational scheduler was not finished scheduling. This would lead to have multiple times the same pipeline being scheduled, and this would lead to inconsistencies and some undefined behavior. This PR fixes this by always returning the pipeline state from the comp_runs table instead so that this becomes the ground truth.
  • To this end a new RabbitMQ event was introduced such that the frontend is notified when the computational scheduler changes the run state.

What do these changes do?

This pull request introduces significant changes to the computation pipeline management and its integration with RabbitMQ messaging. Key updates include the addition of a new RabbitMQ message type for pipeline status, refactoring of pipeline state handling, and improved notification mechanisms for computational pipeline updates.

RabbitMQ Integration Enhancements:

  • Added ComputationalPipelineStatusMessage class in models_library/rabbitmq_messages.py to represent pipeline status messages, including a routing key based on project_id.
  • Introduced publish_pipeline_scheduling_state function in utils/rabbitmq.py to send pipeline status updates via RabbitMQ.
  • Integrated a new RabbitMQ message parser _computational_pipeline_status_message_parser in notifications/_rabbitmq_exclusive_queue_consumers.py to handle pipeline status updates and notify users. [1] [2]

Pipeline State Refactoring:

  • Replaced task-based pipeline state computation with CompRunsRepository for determining pipeline state across multiple endpoints in computations.py. [1] [2] [3] [4]
  • Simplified pipeline state retrieval logic by directly accessing CompRunsRepository in various computation-related functions. [1] [2] [3]

Scheduler Improvements:

  • Enhanced _set_run_result in modules/comp_scheduler/_scheduler_base.py to publish pipeline completion logs and scheduling state updates.
  • Refactored _schedule_tasks_to_stop to handle tasks that can be instantly marked as aborted and return updated task states.

Code Cleanup:

  • Simplified _get_pipeline_at_db in modules/comp_scheduler/_manager.py by removing redundant variable assignments.
  • Removed outdated comments and redundant code in request_pipeline_scheduling in modules/comp_scheduler/_publisher.py. [1] [2]

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Engage milestone Jun 26, 2025
@sanderegg sanderegg self-assigned this Jun 26, 2025
@sanderegg sanderegg added a:webserver webserver's codebase. Assigning the area is particularly useful for bugs a:director-v2 issue related with the director-v2 service labels Jun 26, 2025
@codecov
Copy link

codecov bot commented Jun 26, 2025

Codecov Report

Attention: Patch coverage is 80.00000% with 12 lines in your changes missing coverage. Please review.

Project coverage is 89.38%. Comparing base (6b1b81e) to head (54551df).
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #7996      +/-   ##
==========================================
+ Coverage   87.39%   89.38%   +1.98%     
==========================================
  Files        1578     1379     -199     
  Lines       62317    57741    -4576     
  Branches     1011      477     -534     
==========================================
- Hits        54464    51610    -2854     
+ Misses       7541     6010    -1531     
+ Partials      312      121     -191     
Flag Coverage Δ
integrationtests 64.25% <74.54%> (+0.09%) ⬆️
unittests 87.68% <66.66%> (+1.87%) ⬆️
Components Coverage Δ
api ∅ <ø> (∅)
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library 93.27% <80.00%> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration 69.92% <ø> (ø)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.99% <ø> (-0.06%) ⬇️
agent 96.29% <ø> (ø)
api_server 92.64% <ø> (ø)
autoscaling 96.03% <ø> (ø)
catalog 92.29% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar ∅ <ø> (∅)
datcore_adapter 97.94% <ø> (ø)
director 76.73% <ø> (ø)
director_v2 91.04% <86.00%> (-0.02%) ⬇️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 88.33% <ø> (-1.77%) ⬇️
efs_guardian 89.65% <ø> (∅)
invitations 93.60% <ø> (ø)
payments 92.57% <ø> (ø)
resource_usage_tracker 89.05% <ø> (-0.11%) ⬇️
storage 86.35% <ø> (∅)
webclient ∅ <ø> (∅)
webserver 87.65% <20.00%> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 6b1b81e...54551df. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@sanderegg sanderegg marked this pull request as ready for review June 26, 2025 14:16
Copy link
Contributor

@wvangeit wvangeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, thanks for the effort @sanderegg.
Just a small remark. Isn't there a way to test some of this new functionality?

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe I missed it, but I don't see any backend protection. I know this shouldn't happen from the frontend anymore, but I would still add a check on the backend when a pipeline is about to be scheduled - to ensure there's not already one scheduled or running in the system. If there is, it should return an error.

Ah now I see it, this is it right?

    with contextlib.suppress(ComputationalRunNotFoundError):
        last_run = await comp_runs_repo.get(
            user_id=computation.user_id, project_id=computation.project_id
        )
        pipeline_state = last_run.result

        if utils.is_pipeline_running(pipeline_state):
            raise HTTPException(
                status_code=status.HTTP_409_CONFLICT,
                detail=f"Project {computation.project_id} already started, current state is {pipeline_state}",
            )

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks looks good! 👍 Maybe in the next iterations you can start to use the new comp_runs_snapshot_tasks

Copy link
Member

@odeimaiz odeimaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

Let's check tomorrow how these states are mapped to the Start/Stop buttons.

Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx. Good analysis!

@sanderegg sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 93ed455 to 27eda3e Compare June 27, 2025 06:24
@sanderegg
Copy link
Member Author

sanderegg commented Jun 27, 2025

Looks good, thanks for the effort @sanderegg. Just a small remark. Isn't there a way to test some of this new functionality?

@wvangeit
There are actually no new functionalities here. more of improving old code that grew a bit inconsistent with chosen technologies and multiple edits. tests are already existing. I will add some for the rabbitmq messages though.

@sanderegg
Copy link
Member Author

Maybe I missed it, but I don't see any backend protection. I know this shouldn't happen from the frontend anymore, but I would still add a check on the backend when a pipeline is about to be scheduled - to ensure there's not already one scheduled or running in the system. If there is, it should return an error.

Ah now I see it, this is it right?

    with contextlib.suppress(ComputationalRunNotFoundError):
        last_run = await comp_runs_repo.get(
            user_id=computation.user_id, project_id=computation.project_id
        )
        pipeline_state = last_run.result

        if utils.is_pipeline_running(pipeline_state):
            raise HTTPException(
                status_code=status.HTTP_409_CONFLICT,
                detail=f"Project {computation.project_id} already started, current state is {pipeline_state}",
            )

@matusdrobuliak66 there were actually already protections. The problem here originated in the fact that comp_runs came much later than comp_tasks/comp_pipeline and that the webserver observes comp_tasks table and that is locked at least until all legacy dynamic services are gone. The comp_runs result column was not completely in sync and that is basically the problem.

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fix

@sanderegg sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 27eda3e to 38f0943 Compare June 27, 2025 08:01
@sanderegg sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 38f0943 to 55a9b2d Compare June 27, 2025 12:05
@sanderegg sanderegg force-pushed the bugfix/director-v2-stops-scheduling branch from 9fddcb5 to 54551df Compare June 27, 2025 13:07
@sanderegg
Copy link
Member Author

@mergify queue

@mergify
Copy link
Contributor

mergify bot commented Jun 27, 2025

queue

✅ The pull request has been merged automatically

The pull request has been merged automatically at 2f1b484

@sonarqubecloud
Copy link

@mergify mergify bot merged commit 2f1b484 into ITISFoundation:master Jun 27, 2025
147 of 152 checks passed
@sanderegg sanderegg added the release Preparation for pre-release/release label Jun 30, 2025
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Aug 5, 2025
88 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:director-v2 issue related with the director-v2 service a:webserver webserver's codebase. Assigning the area is particularly useful for bugs release Preparation for pre-release/release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Job state returns UNKNOWN Computational services are in WAITING_FOR_CLUSTER forever

7 participants