Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 22, 2025

What do these changes do?

  • implement Fair queuing in semaphores to prevent starvation when running large amount of computational jobs
  • the current distributed semaphore exposes a flaw when the number of clients is large, since it is not following fair principle (e.g. User X might come before User A because he's luckier), any client asking for the semaphore may acquire it before another that is already waiting, leading to so-called starvation, sometimes an unlucky client may have to wait a very long time or even forever,
  • this creates some cascading timeout effects in the computational scheduler,
  • this PR changes/enhance the distributed semaphore by implementing queuing mechanism in Redis,
  • In redis the following entries may be seen now denoting a semaphore:
    image

What is fair queuing for a distributed semaphore

  • perfect fair queuing means that FIFO is implemented, if user A tries to access the semaphore and user B comes a few bits later, user A shall get it first

What this new implementation of the distributed semaphore does

  • our client to Redis is done via redis-py library which defines a so-called socket timeout of 30seconds, This was removed as it is unnecesary, and generates weird unwanted issues, also not the default socket timeout in redis-py is not set, probably for this very reason,
  • any request to the redis backend via this library is affected by it,
  • a semaphore is a set of permits that can be acquired by a defined number of clients, the next clients have to wait,
  • to ensure fair queuing the BRPOP Redis functionality is used. This is a blocking call that return the next element in one or multiple redis Key(s),
  • each client tries to acquire a permit/token from the semaphore, then adds it to a holders_set key and to a TTL-enabled holder key,
  • if there are no more tokens, the BRPOP function blocks
  • when a client is done with the semaphore, it LPUSH it back in the tokens list, Redis takes care of waking the next client waiting in line,
  • NOTE: In case the socket timeout is set it would timeout the waiting, therefore the client will lose its position in the waiting queue. That means is socket timeout becomes necessary it would reduce the fairness of that queue to the socket timeout window. Therefore it is currently removed to be perfectly fair.

BEFORE vs AFTER:

  • each client constantly retries vs each client waits for the redis server to say something (no constant communication, but no-ops)
  • not fair vs full fairness is guaranteed

Next steps

  • automatic cleanup and recovery of permits in case of a director-v2 crashing while it has permits

Related issue/s

How to test

make devenv
cd package/service-library
make "install-dev[all]"
make test-dev

especially check
test_semaphore.py and test_semaphore_decorator.py

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 22, 2025
@sanderegg sanderegg self-assigned this Sep 22, 2025
@sanderegg sanderegg added a:services-library issues on packages/service-libs a:director-v2 issue related with the director-v2 service a:computational clusters labels Sep 22, 2025
@codecov
Copy link

codecov bot commented Sep 22, 2025

Codecov Report

❌ Patch coverage is 96.21622% with 7 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.90%. Comparing base (d522bc4) to head (4544179).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8401      +/-   ##
==========================================
- Coverage   87.90%   87.90%   -0.01%     
==========================================
  Files        1951     1947       -4     
  Lines       75976    75966      -10     
  Branches     1336     1341       +5     
==========================================
- Hits        66790    66775      -15     
- Misses       8787     8790       +3     
- Partials      399      401       +2     
Flag Coverage Δ
integrationtests 63.96% <ø> (-0.03%) ⬇️
unittests 86.58% <96.21%> (+<0.01%) ⬆️
Components Coverage Δ
pkg_aws_library 93.59% <ø> (ø)
pkg_celery_library 83.58% <ø> (ø)
pkg_dask_task_models_library 79.33% <ø> (ø)
pkg_models_library 93.09% <ø> (ø)
pkg_notifications_library 85.20% <ø> (ø)
pkg_postgres_database 87.95% <ø> (ø)
pkg_service_integration 70.19% <ø> (ø)
pkg_service_library 72.54% <96.21%> (+0.10%) ⬆️
pkg_settings_library 90.19% <ø> (ø)
pkg_simcore_sdk 84.99% <ø> (ø)
agent 93.53% <ø> (ø)
api_server 91.96% <ø> (ø)
autoscaling 95.74% <ø> (ø)
catalog 92.36% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 91.82% <ø> (ø)
datcore_adapter 97.94% <ø> (ø)
director 75.81% <ø> (ø)
director_v2 90.99% <ø> (+0.01%) ⬆️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.36% <ø> (-0.10%) ⬇️
efs_guardian 89.62% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.62% <ø> (ø)
resource_usage_tracker 92.24% <ø> (+0.53%) ⬆️
storage 86.66% <ø> (-0.17%) ⬇️
webclient ∅ <ø> (∅)
webserver 87.97% <ø> (-0.04%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update d522bc4...4544179. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 22, 2025

🧪 CI Insights

Here's what we observed from your CI run for 4544179.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on master Retries 🔍 CI Insights 📄 Logs
CI integration-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@sanderegg sanderegg force-pushed the computational-backend/stability-improvements-step8 branch from a676152 to cd07876 Compare September 22, 2025 14:44
@sanderegg sanderegg marked this pull request as ready for review September 23, 2025 14:47
Copy link
Contributor

@wvangeit wvangeit left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@sanderegg sanderegg changed the title 🎨Computational backend: stability improvements step8 🎨Computational backend: stability improvements step8 🚨 Sep 23, 2025
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks nice. Thanks!

@sanderegg sanderegg force-pushed the computational-backend/stability-improvements-step8 branch from 70d806e to ba2fd03 Compare September 23, 2025 15:16
@sanderegg sanderegg force-pushed the computational-backend/stability-improvements-step8 branch from d2db947 to c5d5b3d Compare September 23, 2025 16:33
Copy link
Member

@pcrespov pcrespov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I undesrtand the idea but not totally the implementation. Pehrpas on friday you can spare a few minutes to clarify some doubts :-)
thx

@sonarqubecloud
Copy link

@sanderegg sanderegg merged commit a210b2d into ITISFoundation:master Sep 23, 2025
95 checks passed
@sanderegg sanderegg deleted the computational-backend/stability-improvements-step8 branch September 23, 2025 18:53
Copy link

@JavierGOrdonnez JavierGOrdonnez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great job!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

a:computational clusters a:director-v2 issue related with the director-v2 service a:services-library issues on packages/service-libs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants