Skip to content

Conversation

@sanderegg
Copy link
Member

@sanderegg sanderegg commented Sep 17, 2025

What do these changes do?

Until now autoscaling for dynamic services would do the following:

  • create a defined number of warm buffers and pre-pull a set of images on them (stopped EC2, only the EBS storage is billed)
  • to cope with the current required pending services:
    1. start a warm buffer EC2 instance if available, connect it to docker swarm and make it available to docker swarm,
    2. or use a "hot" buffer EC2 instance if available and make it available to docker swarm,
    3. or launch a "cold" EC2 instance, connect it to the swarm and make it available to docker swarm
  • ⚡ in case of cold start, the pre-pulling is done as in warm buffers
    • 👎 in case of a lot of large images this can prevent the dynamic sidecar from starting fast as docker needs to pull images it does not need right away
    • 👎 this shows a long 0% progress while starting s4l for instance as this is relying on the dynamic-sidecar to be up
    • 👎 we might be pulling useless docker images
    • 👎 increases uselessly traffic and costs

This PR brings:

  • 👍 EC2_INSTANCES_COLD_START_DOCKER_IMAGES_PRE_PULLING and WORKERS_EC2_INSTANCES_COLD_START_DOCKER_IMAGES_PRE_PULLING a list of images to pre-pull for cold starts (e.g. would be typically the dynamic-sidecar and some OPS related images that always need to be around)
  • 👍 cold started instances now only pre-pull what EC2_INSTANCES_COLD_START_DOCKER_IMAGES_PRE_PULLING contains,
  • 👍 same for computational backend through WORKERS_EC2_INSTANCES_COLD_START_DOCKER_IMAGES_PRE_PULLING
  • 👍 full pre-pull list is now only used for warm-buffer instances
  • 👋 crontab option is gone as unused and also bad for performance
  • changing pre-pulling list will now update currently running hot buffer instances

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Cheops milestone Sep 17, 2025
@sanderegg sanderegg self-assigned this Sep 17, 2025
@sanderegg sanderegg added the a:autoscaling autoscaling service in simcore's stack label Sep 17, 2025
@sanderegg sanderegg changed the title Autoscaling: add subset of prepulled images for cold starts Autoscaling for dynamic services: Performance improvements for cold starts Sep 17, 2025
@codecov
Copy link

codecov bot commented Sep 17, 2025

Codecov Report

❌ Patch coverage is 81.89655% with 21 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.02%. Comparing base (9b3fc23) to head (f1a411e).
⚠️ Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8375      +/-   ##
==========================================
- Coverage   87.55%   87.02%   -0.54%     
==========================================
  Files        2007     2003       -4     
  Lines       78313    78335      +22     
  Branches     1343     1343              
==========================================
- Hits        68569    68170     -399     
- Misses       9342     9763     +421     
  Partials      402      402              
Flag Coverage Δ
integrationtests 60.45% <ø> (-3.52%) ⬇️
unittests 86.25% <81.89%> (-0.03%) ⬇️
Components Coverage Δ
pkg_aws_library 93.61% <80.00%> (+0.02%) ⬆️
pkg_celery_library 85.83% <ø> (ø)
pkg_dask_task_models_library 79.33% <ø> (ø)
pkg_models_library 92.90% <ø> (ø)
pkg_notifications_library 85.20% <ø> (ø)
pkg_postgres_database 87.95% <ø> (ø)
pkg_service_integration 70.17% <ø> (ø)
pkg_service_library 70.96% <ø> (ø)
pkg_settings_library 90.20% <ø> (ø)
pkg_simcore_sdk 84.95% <ø> (ø)
agent 93.10% <ø> (ø)
api_server 91.76% <ø> (ø)
autoscaling 95.00% <81.65%> (-0.73%) ⬇️
catalog 92.06% <ø> (ø)
clusters_keeper 99.14% <100.00%> (+<0.01%) ⬆️
dask_sidecar 92.38% <ø> (ø)
datcore_adapter 97.95% <ø> (ø)
director 75.72% <ø> (ø)
director_v2 85.39% <ø> (-5.52%) ⬇️
dynamic_scheduler 96.82% <ø> (ø)
dynamic_sidecar 90.37% <ø> (-0.07%) ⬇️
efs_guardian 89.83% <ø> (ø)
invitations 90.90% <ø> (ø)
payments 92.80% <ø> (ø)
resource_usage_tracker 92.16% <ø> (-0.06%) ⬇️
storage 86.50% <ø> (-0.09%) ⬇️
webclient ∅ <ø> (∅)
webserver 87.09% <ø> (+<0.01%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 9b3fc23...f1a411e. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Sep 17, 2025

🧪 CI Insights

Here's what we observed from your CI run for f1a411e.

❌ Job Failures

Pipeline Job Health on master Retries 🔍 CI Insights 📄 Logs
CI integration-tests Healthy 0 View View

@sanderegg sanderegg force-pushed the autoscaling/only-pre-pull-if-idle branch from 08a673e to 7d31f74 Compare September 17, 2025 08:40
@sanderegg sanderegg changed the title Autoscaling for dynamic services: Performance improvements for cold starts Autoscaling for dynamic services: Performance improvements for cold starts (⚠️ devops) Sep 17, 2025
@sanderegg sanderegg force-pushed the autoscaling/only-pre-pull-if-idle branch from 206065c to 7b4a0df Compare September 18, 2025 05:56
@sanderegg sanderegg marked this pull request as ready for review September 18, 2025 05:57
@sanderegg sanderegg changed the title Autoscaling for dynamic services: Performance improvements for cold starts (⚠️ devops) Autoscaling for dynamic services: Performance improvements for cold starts (⚠️ devops) 🚨🚨🚨 Sep 18, 2025
@sanderegg sanderegg marked this pull request as draft September 18, 2025 05:59
@sanderegg
Copy link
Member Author

This still needs a change for hot buffers handling. sorry for ready for review assignment.

@sanderegg sanderegg force-pushed the autoscaling/only-pre-pull-if-idle branch 2 times, most recently from c51773d to 93255ca Compare September 19, 2025 09:31
@sanderegg sanderegg force-pushed the autoscaling/only-pre-pull-if-idle branch from f3b7191 to f1a411e Compare October 16, 2025 07:26
@sonarqubecloud
Copy link

@sanderegg sanderegg merged commit 9afa492 into ITISFoundation:master Oct 16, 2025
87 of 91 checks passed
@sanderegg sanderegg deleted the autoscaling/only-pre-pull-if-idle branch October 16, 2025 07:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:autoscaling autoscaling service in simcore's stack

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants