Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 10, 2025

This PR introduces a distributed semaphore support to the service library, enabling concurrency limits for async tasks using Redis.
It adds a new DistributedSemaphore class, associated error types, and Lua scripts to ensure atomic semaphore operations.

Distributed Semaphore Feature

  • Added DistributedSemaphore class in servicelib/redis/_semaphore.py to provide a Redis-backed distributed semaphore for limiting concurrent operations across processes/instances. This includes context manager support and atomic acquire/release/renew/count operations via Lua scripts.
  • Introduced new semaphore-related error classes: SemaphoreAcquisitionError, SemaphoreNotAcquiredError, and SemaphoreLostError in servicelib/redis/_errors.py.
  • Adds the new with_limited_concurrency decorator and with_limited_concurrency_cm decorator for usage with coroutines/async context managers.

Director-v2

  • Use the DistributedSemaphore to limit the access to each user/wallet combination of a dask-scheduler. This shall improve the performance of the director-v2 computational scheduler by limiting its wild access to the underlying dask-scheduler. The latter would not be able to cope with hundreds of concurrent calls to it. This will now ensure multiple replicas of the director-v2 are not accessing each dask-scheduler by more than 20 at a time.

How the Semaphore Works

Semaphore Representation: Each semaphore is a Redis sorted set (ZSET), where instance IDs are members and their acquisition timestamps are scores. This allows for time-based expiration and ordering of holders.

Acquisition (acquire_semaphore.lua): On acquire, expired holders are cleaned, capacity is checked, and if available, the instance is inserted into the ZSET and a separate holder key (with TTL) is set. Success/failure, current counts, and cleanup information are returned atomically.

Counting (count_semaphore.lua): Expired holders are removed, and the ZSET cardinality is counted for the active holders.

Release (release_semaphore.lua): Checks if the instance is a current holder, cleans expired, then deletes ZSET entry and holder key. Handles already released and expired situations gracefully.

Renewal (renew_semaphore.lua): Validates current holding, checks if the holder key has expired, and if not, renews both the ZSET score and holder key TTL.

References:

What do these changes do?

Related issue/s

How to test

make devenv
source .venv/bin/activate
cd packages/service-library
make "install-dev[all]"
clear && pytest tests/redis/test_semaphore*

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 10, 2025
@sanderegg sanderegg self-assigned this Sep 10, 2025
@sanderegg sanderegg added a:director-v2 issue related with the director-v2 service a:computational clusters labels Sep 10, 2025
@codecov
Copy link

codecov bot commented Sep 10, 2025

Codecov Report

❌ Patch coverage is 99.17012% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.79%. Comparing base (4600d77) to head (701c798).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8349      +/-   ##
==========================================
- Coverage   87.86%   87.79%   -0.08%     
==========================================
  Files        1947     1521     -426     
  Lines       75673    63198   -12475     
  Branches     1322      679     -643     
==========================================
- Hits        66487    55482   -11005     
+ Misses       8789     7482    -1307     
+ Partials      397      234     -163     
Flag Coverage Δ
integrationtests 63.92% <92.85%> (+0.01%) ⬆️
unittests 86.20% <99.17%> (-0.33%) ⬇️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library 72.59% <99.55%> (+0.84%) ⬆️
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 84.99% <ø> (+0.05%) ⬆️
agent 93.53% <ø> (ø)
api_server 91.93% <ø> (ø)
autoscaling 95.77% <ø> (ø)
catalog 92.36% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 92.37% <ø> (ø)
datcore_adapter 97.94% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 90.94% <92.85%> (-0.02%) ⬇️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.39% <ø> (-0.07%) ⬇️
efs_guardian 89.62% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.61% <ø> (ø)
resource_usage_tracker 92.18% <ø> (ø)
storage 86.74% <ø> (+0.29%) ⬆️
webclient ∅ <ø> (∅)
webserver 87.96% <ø> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 4600d77...701c798. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 10, 2025

🧪 CI Insights

Here's what we observed from your CI run for 701c798.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI unit-tests Base branch is broken, but retries were needed. Could be early signs of flakiness 👀 Broken 1 View View

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step3 branch 3 times, most recently from 69434bf to 4d177b9 Compare September 11, 2025 21:12
@sanderegg sanderegg requested a review from Copilot September 11, 2025 21:13
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces performance improvements to the computational backend by implementing a Redis-based distributed semaphore system to limit concurrent access and adding a new docker image source for redis-commander.

  • Implements a comprehensive distributed semaphore system using Redis with Lua scripts for atomic operations
  • Adds a decorator with_limited_concurrency to control concurrent function execution across processes
  • Updates docker-compose configurations to use a different redis-commander image source

Reviewed Changes

Copilot reviewed 23 out of 26 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
services/web/server/tests/unit/with_dbs/docker-compose-devel.yml Updates redis-commander image source and adds user configuration
services/docker-compose-ops.yml Updates redis-commander image source and adds user configuration
services/director-v2/src/simcore_service_director_v2/modules/comp_scheduler/_scheduler_dask.py Integrates distributed semaphore to limit concurrent cluster client acquisitions
packages/service-library/tests/redis/test_semaphore_decorator.py Comprehensive test suite for the semaphore decorator functionality
packages/service-library/tests/redis/test_semaphore.py Test suite for the core distributed semaphore implementation
packages/service-library/tests/redis/test_decorators.py Minor spelling correction in comment
packages/service-library/tests/redis/conftest.py Adds test fixtures for semaphore testing
packages/service-library/src/servicelib/redis/lua/*.lua Atomic Lua scripts for semaphore operations (acquire, release, renew, count)
packages/service-library/src/servicelib/redis/_*.py Core semaphore implementation with decorator, errors, and utilities
.github/instructions/*.md Restructures GitHub Copilot instructions into separate files

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step3 branch from 3ace35b to 03b3d4b Compare September 12, 2025 09:10
@sanderegg sanderegg requested a review from Copilot September 12, 2025 09:43
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 26 out of 29 changed files in this pull request and generated 2 comments.

@sanderegg sanderegg requested a review from Copilot September 12, 2025 09:59
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 28 out of 30 changed files in this pull request and generated 4 comments.

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step3 branch 3 times, most recently from b674c21 to 399af1a Compare September 15, 2025 13:50
@sanderegg sanderegg requested a review from Copilot September 15, 2025 16:29
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

Copilot reviewed 19 out of 19 changed files in this pull request and generated 4 comments.


Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step3 branch 2 times, most recently from f89e4a0 to 5b96c97 Compare September 16, 2025 10:20
@sanderegg sanderegg marked this pull request as ready for review September 16, 2025 12:41
@sanderegg sanderegg force-pushed the computational-backend/improvements-diverse-step3 branch from b60cec1 to 701c798 Compare September 16, 2025 19:39
@sonarqubecloud
Copy link

@sanderegg sanderegg requested a review from GitHK September 17, 2025 05:26
@sanderegg sanderegg merged commit 662c3b5 into ITISFoundation:master Sep 17, 2025
145 of 148 checks passed
@sanderegg sanderegg deleted the computational-backend/improvements-diverse-step3 branch September 17, 2025 05:27
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:director-v2 issue related with the director-v2 service

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants