Skip to content

Conversation

@mrnicegyu11
Copy link
Member

@mrnicegyu11 mrnicegyu11 commented Aug 7, 2025

What do these changes do?

Exposes the env-vars POSTGRES_MINSIZE and POSTGRES_MAXSIZE. This can be used to elevate e.g. POSTGRES_MINSIZE which leads to drastic performance increases on osparc-master.speag.com .

Elevates / Reverts back the default value of POSTGRES_MINSIZE from 1 to 2 (see long comment below). 5 is the actual default in asyncpg. It was overwritten to 1 in the past. Now the value can be changed via env-vars.

jit compilation is disabled (see below for why).

BONUS: Hostnames now clearly identify services (also in postgres/adminer, per client) removed

Related issue/s

https://www.perplexity.ai/page/postgres-query-slowness-invest-7bOGFHGrTEWSoOYv4FcS_g

Ops-PR: https://git.speag.com/oSparc/osparc-ops-deployment-configuration/-/merge_requests/1531

How to test

Dev-ops

@mrnicegyu11 mrnicegyu11 self-assigned this Aug 7, 2025
@codecov
Copy link

codecov bot commented Aug 7, 2025

Codecov Report

❌ Patch coverage is 33.33333% with 2 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.85%. Comparing base (15204d5) to head (01cf113).
⚠️ Report is 36 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8199      +/-   ##
==========================================
+ Coverage   79.26%   87.85%   +8.59%     
==========================================
  Files        1855     1910      +55     
  Lines       70991    73418    +2427     
  Branches     1301     1301              
==========================================
+ Hits        56272    64504    +8232     
+ Misses      14332     8527    -5805     
  Partials      387      387              
Flag Coverage Δ
integrationtests 73.40% <ø> (+9.16%) ⬆️
unittests 86.72% <33.33%> (+0.30%) ⬆️
Components Coverage Δ
pkg_aws_library 93.93% <ø> (ø)
pkg_celery_library 87.37% <ø> (ø)
pkg_dask_task_models_library 79.62% <ø> (ø)
pkg_models_library 93.03% <ø> (ø)
pkg_notifications_library 85.26% <ø> (ø)
pkg_postgres_database 88.02% <ø> (ø)
pkg_service_integration 70.19% <ø> (ø)
pkg_service_library 71.71% <0.00%> (ø)
pkg_settings_library 90.46% <ø> (ø)
pkg_simcore_sdk 85.10% <ø> (+0.11%) ⬆️
agent 93.81% <ø> (ø)
api_server 93.20% <ø> (ø)
autoscaling 95.89% <ø> (ø)
catalog 92.34% <ø> (ø)
clusters_keeper 99.13% <ø> (ø)
dask_sidecar 91.81% <ø> (-0.57%) ⬇️
datcore_adapter 97.94% <ø> (ø)
director 76.14% <ø> (-0.10%) ⬇️
director_v2 90.96% <ø> (+0.11%) ⬆️
dynamic_scheduler 96.27% <ø> (ø)
dynamic_sidecar 90.12% <ø> (ø)
efs_guardian 89.60% <ø> (ø)
invitations 91.44% <ø> (ø)
payments 92.60% <ø> (ø)
resource_usage_tracker 92.18% <100.00%> (-0.38%) ⬇️
storage 86.54% <ø> (∅)
webclient ∅ <ø> (∅)
webserver 87.50% <ø> (+28.40%) ⬆️

Continue to review full report in Codecov by Sentry.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 15204d5...01cf113. Read the comment docs.

🚀 New features to boost your workflow:
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Copy link
Member

@odeimaiz odeimaiz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks

@mrnicegyu11 mrnicegyu11 marked this pull request as ready for review August 7, 2025 08:07
Copy link
Contributor

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mrnicegyu11

Are there any sources beyond AI that support these changes? Could you add them to description?

It would be also very helpful to have a brief description of the parameters changed (what exactly these settings mean)

Copy link
Collaborator

@matusdrobuliak66 matusdrobuliak66 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks

@mrnicegyu11
Copy link
Member Author

mrnicegyu11 commented Aug 7, 2025

@mrnicegyu11

Are there any sources beyond AI that support these changes? Could you add them to description?

It would be also very helpful to have a brief description of the parameters changed (what exactly these settings mean)

This was validated "experimentally", by checking if putting POSTGRES_MINSIZE=10 will make osparc-master.speag.com faster. And it does, as this Grafana-TraceQL plot shows. At 10:44 on the x-Axis the mentioned env-var was added. One can see that the average response time for /v0/me goes down drastically.
image

Apart from this, an AI investigation has actually pointed me towards this finding. I was actually thinking that we hit the max limit of pooling connections, but we have prometheus metrics that prove this isnt the case. By removing CPU limitations I checked the effect of CPU throttling on postgres, but it did not lead to a speedup. At some point I elevated the min-connections and this helped
@YuryHrytsuk

@mrnicegyu11 mrnicegyu11 requested a review from YuryHrytsuk August 8, 2025 08:07
@mergify
Copy link
Contributor

mergify bot commented Aug 8, 2025

🧪 CI Insights

Here's what we observed from your CI run for 01cf113.

❌ Failed Jobs

Pipeline Job Health on base branch Retries 🔍 CI Insights 📄 Logs
CI [int] webserver 01 (3.11, ubuntu-24.04) Healthy 0 View View
integration-tests Healthy 0 View View

✅ Passed Jobs With Interesting Signals

Pipeline Job Signal Health on base branch Retries 🔍 CI Insights 📄 Logs
CI system-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View
unit-tests Base branch is broken, but the job passed. Looks like this might be a real fix 💪 Broken 0 View View

@YuryHrytsuk
Copy link
Contributor

This was validated "experimentally", by checking if putting POSTGRES_MINSIZE=10 will make osparc-master.speag.com faster. And it does, as this Grafana-TraceQL plot shows. At 10:44 on the x-Axis the mentioned env-var was added. One can see that the average response time for /v0/me goes down drastically.

What is Y axis on the image? integer numbers do not looks like seconds or sth

@mrnicegyu11
Copy link
Member Author

This was validated "experimentally", by checking if putting POSTGRES_MINSIZE=10 will make osparc-master.speag.com faster. And it does, as this Grafana-TraceQL plot shows. At 10:44 on the x-Axis the mentioned env-var was added. One can see that the average response time for /v0/me goes down drastically.

What is Y axis on the image? integer numbers do not looks like seconds or sth

@YuryHrytsuk it is seconds

Copy link
Contributor

@YuryHrytsuk YuryHrytsuk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks 🙏

Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Multiple things here:

  • this PR changes the minimum number of connections to postgres in all the services that use postgres from 1 to 5. For example the webserver uses many connections, probably the director-v2, most of the other services use maybe 1,2 or 3 (such as every dynamic sidecar). Was this thought about?
  • changing the settings default value instead of using the ENV variable to fix only master to test. Why?
  • using the ENVs we can even change only the minimum by service

@mrnicegyu11 mrnicegyu11 added this to the Voyager milestone Aug 13, 2025
@mrnicegyu11
Copy link
Member Author

@sanderegg
Copy link
Member

** Asking for re-review from @sanderegg @pcrespov **

@mrnicegyu11 just click the re-review button next time.

@sanderegg sanderegg self-requested a review August 13, 2025 14:36
Copy link
Member

@sanderegg sanderegg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you very much for the extensive testing, this is very interesting.
Here is the link to SQLAlchemy official docs: https://docs.sqlalchemy.org/en/20/dialects/postgresql.html#disabling-the-postgresql-jit-to-improve-enum-datatype-handling

Nevertheless:

  • From your graph I see that setting the min size to 2 already brings all the performance up. I don't really see the difference then between jit on or off? Maybe you can explain tomorrow?
  • I would suggest that you add a NOTE next to the postgres settings change that links to at least the SQLAlchemy link I posted above to document why it is now hard-coded to minimum 2.
  • Also you should probably change the ge on MIN_SIZE and MAX_SIZE to 2 then.
  • so when you increased the number of connections I guess that since this JIT thing happens on new connection, it already set it up and that is why it is faster?

@bisgaard-itis
Copy link
Contributor

bisgaard-itis commented Aug 14, 2025

More tests - new findings

After chatting with @sanderegg I ran some more tests, this time on osparc-simcore local development build ("make up-devel").

After getting an initial hint from perplexity, I found that many people have issues with "slow introspection when using multiple custom types" (see e.g. MagicStack/asyncpg#530 and links in MagicStack/asyncpg#1082 ). These are asyncpg internal calls, that are "big", and called often. Ad-hoc jit optimization of postgres slows them down considerably, and effectively makes the python code that calls postgres via asyncpg wait. I could verify locally that jit has impact and performance of the /v0/me route.

Benchmark

* Sending 100 `/v0/me` requests*: `for i in {{1..100}}; do curl 'http://your-ip.nip.io:9081/v0/me'   --compressed   -H 'User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:141.0) Gecko/20100101 Firefox/141.0'   -H 'Accept: application/json'   -H 'Accept-Language: en-US,en;q=0.5'   -H 'Accept-Encoding: gzip, deflate'   -H 'X-Requested-With: XMLHttpRequest'   -H 'Content-Type: application/json'   -H 'X-Simcore-Products-Name: osparc'   -H 'X-Client-Session-Id: e37de453-47ba-4685-9c61-6c8f20bbeb0b'   -H 'Connection: keep-alive'   -H 'Referer: http://your-ip.nip.io:9081/'   -H 'Cookie: adminer_sid=2ahifga0p854868kf5m05kau1q; adminer_permanent=; adminer_key=6a0cc76fb342c2f6f834ca5454ac8ea1; adminer_version=5.3.0; osparc-sc2="foobar"; [email protected]; _7b487=dac2d7ed7eae2ac5; portainer_api_key=foobar ; done` (after logging into the platform, obtaining cookie)

Parameters: POSTGRES_MINSIZE, jit on / off constants: webserver-replica=1

Results for varying POSTGRES_MINSIZE [1,2,3,5,10] with jit:off

image

One can see that setting POSTGRES_MINSIZE>1 leads to performance increases comapred to setting it to 1

Results for varying POSTGRES_MINSIZE [1,2] and jit: off / on

image

One can see that POSTGRES_MINSIZE=1 with jit:on is a bad idea.

For this reason I propose to turn jit off and set POSTGRES_MINSIZE=2 as default anywhere

BONUS: hostnames and the names of clients listed in postgres now are unique per service and can clearly identify which service is owning which pg-client

Very interesting stuff! However, generally it is best practice to use make up-prod when checking performance. But I guess it is fine in this case as you are changing a DB setting.

Copy link
Contributor

@bisgaard-itis bisgaard-itis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the great effort! Nice findings!

@mrnicegyu11 mrnicegyu11 requested a review from GitHK as a code owner August 14, 2025 12:23
Copy link
Contributor

@GitHK GitHK left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, interesting

@sonarqubecloud
Copy link

@mrnicegyu11 mrnicegyu11 merged commit e24f5b4 into ITISFoundation:master Aug 15, 2025
88 of 91 checks passed
@mrnicegyu11 mrnicegyu11 deleted the 2025/change/makePOSTGRESSIZEconfigurable branch August 15, 2025 08:41
@sanderegg sanderegg added the t:maintenance Some planned maintenance work label Aug 26, 2025
@matusdrobuliak66 matusdrobuliak66 mentioned this pull request Sep 2, 2025
61 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

t:maintenance Some planned maintenance work

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants