Skip to content

Conversation

@sanderegg
Copy link
Member

What do these changes do?

While working on #8099 and after setting once a lot of warm buffer machines, it was discovered that the warm buffer background task is not creating machines in a reliable way.
The EC2 interface allows to create a number of instances between a minimum and a maximum and the warm buffers were created always with both set to the same number. That means that if one would ask for 20 warm buffers suddenly EC2 would have to create 1 batch of 20.
This PR changes this to ask between 1 and N machines. Therefore if AWS can provide 1 machine at a time it will work as well.

Related issue/s

How to test

Dev-ops

@sanderegg sanderegg added this to the Voyager milestone Aug 26, 2025
@sanderegg sanderegg self-assigned this Aug 26, 2025
@sanderegg sanderegg added a:autoscaling autoscaling service in simcore's stack 🤖-automerge marks PR as ready to be merged for Mergify labels Aug 26, 2025
@sanderegg sanderegg requested a review from Copilot August 26, 2025 10:00
@sanderegg sanderegg force-pushed the autoscaling/create-1-warmbuffer-at-a-time branch from a83d521 to c39da4f Compare August 26, 2025 10:01
@sanderegg
Copy link
Member Author

@mergify queue

@mergify
Copy link
Contributor

mergify bot commented Aug 26, 2025

queue

🟠 Waiting for conditions to match

  • -closed [📌 queue requirement]
  • -conflict [📌 queue requirement]
  • -draft [📌 queue requirement]
  • any of: [📌 queue -> configuration change requirements]
    • -mergify-configuration-changed
    • check-success = Configuration changed
  • any of: [🔀 queue conditions]
    • all of: [📌 queue conditions of queue default]
      • #approved-reviews-by >= 2 [🛡 GitHub branch protection]
      • #approved-reviews-by>=2
      • #changes-requested-reviews-by = 0 [🛡 GitHub branch protection]
      • #changes-requested-reviews-by=0
      • #review-threads-unresolved = 0 [🛡 GitHub branch protection]
      • #review-threads-unresolved=0
      • -conflict
      • -draft
      • base=master
      • branch-protection-review-decision = APPROVED [🛡 GitHub branch protection]
      • label!=🤖-do-not-merge
      • label=🤖-automerge
      • any of: [🛡 GitHub branch protection]
        • check-skipped = deploy to dockerhub
        • check-neutral = deploy to dockerhub
        • check-success = deploy to dockerhub
      • any of: [🛡 GitHub branch protection]
        • check-success = system-tests
        • check-neutral = system-tests
        • check-skipped = system-tests
      • any of: [🛡 GitHub branch protection]
        • check-success = unit-tests
        • check-neutral = unit-tests
        • check-skipped = unit-tests
      • any of: [🛡 GitHub branch protection]
        • check-success = check OAS' are up to date
        • check-neutral = check OAS' are up to date
        • check-skipped = check OAS' are up to date
      • any of: [🛡 GitHub branch protection]
        • check-success = integration-tests
        • check-neutral = integration-tests
        • check-skipped = integration-tests
      • any of: [🛡 GitHub branch protection]
        • check-success = build-test-images (frontend) / build-test-images
        • check-neutral = build-test-images (frontend) / build-test-images
        • check-skipped = build-test-images (frontend) / build-test-images
      • any of: [🛡 GitHub branch protection]
        • check-success = SonarCloud Code Analysis
        • check-neutral = SonarCloud Code Analysis
        • check-skipped = SonarCloud Code Analysis

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR improves the reliability of warm buffer machine creation in the autoscaling service by modifying the EC2 instance creation strategy. Instead of requiring AWS to create all requested machines in a single batch (which could fail if resources are limited), it now allows AWS to create machines incrementally, starting with at least 1 machine at a time.

  • Changes the minimum instance creation requirement from N machines to 1 machine
  • Maintains the desired total number of machines while allowing for more flexible provisioning

@codecov
Copy link

codecov bot commented Aug 26, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 68.45%. Comparing base (35e7048) to head (ff30fe6).
⚠️ Report is 1 commits behind head on master.

❗ There is a different number of reports uploaded between BASE (35e7048) and HEAD (ff30fe6). Click for more details.

HEAD has 31 uploads less than BASE
Flag BASE (35e7048) HEAD (ff30fe6)
unittests 32 1
Additional details and impacted files
@@             Coverage Diff             @@
##           master    #8262       +/-   ##
===========================================
- Coverage   88.03%   68.45%   -19.59%     
===========================================
  Files        1919      781     -1138     
  Lines       74341    36587    -37754     
  Branches     1305      175     -1130     
===========================================
- Hits        65449    25047    -40402     
- Misses       8499    11483     +2984     
+ Partials      393       57      -336     
Flag Coverage Δ
integrationtests 64.27% <ø> (+0.04%) ⬆️
unittests 95.89% <ø> (+9.21%) ⬆️
Components Coverage Δ
pkg_aws_library ∅ <ø> (∅)
pkg_celery_library ∅ <ø> (∅)
pkg_dask_task_models_library ∅ <ø> (∅)
pkg_models_library ∅ <ø> (∅)
pkg_notifications_library ∅ <ø> (∅)
pkg_postgres_database ∅ <ø> (∅)
pkg_service_integration ∅ <ø> (∅)
pkg_service_library ∅ <ø> (∅)
pkg_settings_library ∅ <ø> (∅)
pkg_simcore_sdk 76.95% <ø> (-8.08%) ⬇️
agent ∅ <ø> (∅)
api_server ∅ <ø> (∅)
autoscaling 95.89% <ø> (ø)
catalog ∅ <ø> (∅)
clusters_keeper ∅ <ø> (∅)
dask_sidecar ∅ <ø> (∅)
datcore_adapter ∅ <ø> (∅)
director ∅ <ø> (∅)
director_v2 78.25% <ø> (-12.67%) ⬇️
dynamic_scheduler ∅ <ø> (∅)
dynamic_sidecar 87.19% <ø> (-2.91%) ⬇️
efs_guardian ∅ <ø> (∅)
invitations ∅ <ø> (∅)
payments ∅ <ø> (∅)
resource_usage_tracker ∅ <ø> (∅)
storage ∅ <ø> (∅)
webclient ∅ <ø> (∅)
webserver 59.13% <ø> (-29.00%) ⬇️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 35e7048...ff30fe6. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@mergify
Copy link
Contributor

mergify bot commented Aug 26, 2025

🧪 CI Insights

Here's what we observed from your CI run for ff30fe6.

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on base branch Retries 🔍 CI Insights 📄 Logs
CI integration-tests Base branch is healthy, but retries were needed. Could be early signs of flakiness 👀 Healthy 2 View View
system-tests Base branch is healthy, but retries were needed. Could be early signs of flakiness 👀 Healthy 1 View View

@sanderegg sanderegg changed the title 🎨Autoscaling: warm buffers: create at least 1 machine at a time instead of X 🎨Autoscaling: warm buffers: create at minimum 1 machine at a time instead of asking directly for the required number Aug 26, 2025
…m:sanderegg/osparc-simcore into autoscaling/create-1-warmbuffer-at-a-time
@sonarqubecloud
Copy link

@sanderegg sanderegg requested a review from GitHK August 26, 2025 14:17
@sanderegg sanderegg merged commit 52a6abe into ITISFoundation:master Aug 26, 2025
192 of 201 checks passed
@sanderegg sanderegg deleted the autoscaling/create-1-warmbuffer-at-a-time branch August 26, 2025 14:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

🤖-automerge marks PR as ready to be merged for Mergify a:autoscaling autoscaling service in simcore's stack

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants