Skip to content

Conversation

@Jmevorach
Copy link
Contributor

@Jmevorach Jmevorach commented Nov 26, 2025

Refactors the OpenEMR 7.0.5 Docker image for significantly improved startup performance and reduced memory footprint. This is a work-in-progress that requires community testing to verify all container functionality before merging.

Performance Improvements

Benchmarking against the current Docker Hub image (openemr/openemr:7.0.5) shows:

Metric Optimized Original Improvement
Startup Time 15.0s 73.1s 4.9x faster
Memory (Avg) 92.8 MB 304.2 MB 69% reduction
Memory (Peak) 117.1 MB 326.6 MB 64% reduction
Throughput 114.9 req/s 117.2 req/s Equivalent

Technical Changes

Build-Time Permission Optimization (Primary)

  • Moved file permission operations (chmod/chown) from runtime to build time
  • The original container scanned and modified ~15,000 files on every startup (40-60s)
  • Now permissions are baked into the image; only setup-modified files need runtime adjustment

Startup Script Modernization

  • Rewrote openemr.sh from POSIX sh to bash with set -euo pipefail
  • Added comprehensive inline documentation and clear section organization
  • Improved database wait logic with exponential backoff
  • Simplified permission handling for runtime efficiency

Dockerfile Documentation

  • Added extensive comments explaining each build stage
  • Documented package purposes, permission schemes, and build targets

Testing Needed

This PR needs help verifying that all OpenEMR container functions work correctly:

  • Fresh installation (auto-configure)
  • Manual setup mode
  • SSL/TLS certificate configuration
  • Redis session handling
  • Kubernetes mode (admin/worker roles)
  • Swarm mode coordination
  • Upgrade path from previous versions
  • XDebug configuration
  • Let's Encrypt integration
  • Document upload/storage
  • Multi-site configurations

How to Test

Build and test the optimized image:
cd docker/openemr/7.0.5

docker build -t openemr:7.0.5-optimized .
docker compose up -d

Or use the benchmarking utility to compare performance: cd utilities/container_benchmarking
./benchmark.sh

Related Files

  • docker/openemr/7.0.5/* - Optimized container files
  • utilities/container_benchmarking/ - Benchmarking suite for performance validation

Summary of the Container Benchmarking Utility

The Container Benchmarking Utility (utilities/container_benchmarking/) is a valuable development tool for the OpenEMR project:

What It Does

  • Compares two container images side-by-side (local build vs. Docker Hub reference)
  • Measures key performance metrics:
    • Startup time: How long until the container is healthy
    • Throughput: Requests per second under load (Apache Bench)
    • Resource usage: CPU and memory during operation

Why It's Useful for OpenEMR Container Development

  1. Quantifies optimizations: Instead of guessing, you get hard numbers (e.g., "4.9x faster startup")
  2. Prevents regressions: Compare any changes against the baseline Docker Hub image
  3. Quick feedback loop: Full benchmark suite runs in 3-5 minutes
  4. Reproducible results: Documented methodology with timestamp-organized output
  5. Analysis tools: Includes summary.sh, compare_results.sh, and CSV export for deeper analysis

Recommended Use Cases

  • Before PRs: Verify performance impact of container changes
  • CI/CD integration: Automated regression detection in GitHub Actions
  • Version comparisons: Test across OpenEMR versions (7.0.3, 7.0.4, 7.0.5, etc.)
  • Optimization validation: Prove that changes actually improve performance

The utility democratizes performance testing—any contributor can run benchmarks locally and share quantified results, making it easier to evaluate and merge container improvements.

Copy link
Member

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.

Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.

The other stuff is about taking better advantage of current docker buildtime features like caching.

And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.

Copy link
Member

@kojiromike kojiromike left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.

Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.

The other stuff is about taking better advantage of current docker buildtime features like caching.

And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.

@Jmevorach
Copy link
Contributor Author

Thank you for taking the time to go through all this @kojiromike ; really appreciate your time. I'll be busy for a little bit but I should have time in ~2 weeks to start going through all the comments you left and making edits. If anyone wants to step in before then and start making edits themselves please feel free; I welcome all the help I can get on this PR.

The time from container start to when it passes healthchecks is something really important to me because the shorter that time is the better our autoscaling will be when running multiple horizontal replicas. If it takes 5 minutes to spin up a new node it doesn't mean you can't autoscale but it means that you won't be able to autoscale as well as if it took 1 minute. There's some things you can do with Karpenter and other autoscalers to autoscale performantly even when startup time is relatively long (things like having pre-spun up hosts ready to serve traffic waiting rather than spinning up hosts from scratch every time there's an autoscaling event) but reducing the "time from start to healthy" will improve horizontal autoscaling no matter where OpenEMR is run.

While reducing the size of the container image is good in general (especially because it helps reduces some of the hardware specs you'd need to run OpenEMR using docker and the time it takes to pull the container); the rise of container caching technologies (specifically for Kubernetes and basically anywhere else you'd host this in a large public cloud provider) means that this is something I'm planning to prioritize less than reducing the amount of operations the container has to perform when it starts because I think this is going to impact startup time way more than container image size.

I feel the same way about reducing the build times of the containers. Reducing build times is never a bad thing (especially because it allows developers to test more easily) but I think that prioritizing performing the least amount of operations and performing those operations as quickly as possible on container start (especially if those operations are the same each time and can be accomplished at Docker build time or by making configurations to repositories where build materials are pulled from) is going to give us the greatest "value for effort" for the time being.

I think the fastest way to startup would basically be to ...

  1. Elect a leader.
  2. Have the leader launch multiple parallel processes (as many as we can manage without sacrificing reliability) to complete setup and collect the PIDs of each so that when they complete the leader can make sure they all did so successfully.
  3. As much as possible while waiting for the leader to finish have follower containers also perform their own setup tasks (and parallelize them to the greatest extent possible).

What I'd really like to do is introduce automated CI/CD for container builds that verifies that all of this functionality works as its supposed to:

  • Manual setup mode
  • SSL/TLS certificate configuration
  • Redis session handling
  • Kubernetes mode (admin/worker roles)
  • Swarm mode coordination
  • Upgrade path from previous versions
  • XDebug configuration
  • Let's Encrypt integration
  • Multi-site configurations

I've managed to make a "high availability" docker compose setup (put high availability in quotes because even if you have two containers running, running both of them on the same host is not really high availability) that launches two containers behind a load balancer in docker-compose which can be used to test that Swarm node configuration works so I'm making some progress in this regard but, again, I'll more than happily take all the help I can on pushing this forward.

The benchmarking is a good first step because it allows us to quickly verify the performance impacts of a change to one of the containers but as long as we have to more or less manually verify that the functionality above works as expected there's only going to be so fast we'll be able to iterate. By the way thanks for all your work in introducing all the unit tests you've introduced so far. I'm a big fan of what's going on in the "tests" folder in the "openemr/openemr" repo and everything going on in the ".github" folder in this repo.

If we have CI/CD that verifies all the functionality the container should have and uses the benchmarking utilities then the process to find optimizations like this becomes much easier. Anyone would be able to fork the repository, make a change, and then know in minutes whether or not that change improved the container and maintained the functionality we need to properly run OpenEMR.

@kojiromike
Copy link
Member

The time from container start to when it passes healthchecks is something really important…

Agreed

I think the fastest way to startup would basically be to ...

1. Elect a leader.
2. Have the leader launch multiple parallel processes (as many as we can manage without sacrificing reliability) to complete setup and collect the PIDs of each so that when they complete the leader can make sure they all did so successfully.
3. As much as possible while waiting for the leader to finish have follower containers also perform their own setup tasks (and parallelize them to the greatest extent possible).

Let me see if I understand. Is openemr.sh slow when the application is already set up and ready to go, and we're just scaling it horizontally by adding another node in basically OPERATOR mode? Or is the problem that openemr.sh is slow starting up when it has a bunch of setup tasks to do? Or both?

@Jmevorach
Copy link
Contributor Author

@kojiromike

Both.

Most of my contributions to the OpenEMR containers come from a perspective similar to an operator of a cluster (be it Kubernetes or otherwise in the case of ECS) who runs OpenEMR on behalf of a larger organization. Sort of like if you were a DevOps engineer hired by a hospital and tasked with operating a cluster running OpenEMR. The difference is rather than me being concerned about a specific installation I'm basically trying to make machines that print out highly reliable setups for other operators. Most of the time I try to do that by acting like I'm an operator myself and stress testing these setups I'm printing out to see what I need to improve with the machine that makes them.

When we first started making OpenEMR on ECS we noticed that the deployment was failing because the healthchecks were timing out. At first we thought this was because the containers had failed some setup task and so they were just sitting idle until they failed healthchecks and were terminated but then we realized that this was because the containers were just taking a really long time to start up.

When we run on ECS we use Fargate and run on a collection of Graviton 2 and Graviton 3 processors in whatever AWS region you decide to use. When we run on EKS we tend to use (although auto-mode defines node-pools with multiple types of instances) Graviton 3 or 4. I ended up deciding on Graviton because it was 1/ cheaper than other comparable instance types 2/ There's been a bunch of optimization work done for running PHP on Graviton and I found it generally performed better than comparable x86 based instances. Also AWS has a lot of ARM chips so from the perspective of autoscaling I think the chance that one of our clusters is going to ask an AWS data center for another ARM vCPU and for that data center to say "sorry we're all out" is pretty low.

This allows us to operate OpenEMR at an incredibly large scale and to do so relatively affordably. The initial load testing results we got from using just the ECS version alone (which is arguably less performant overall than the EKS setup) show us comfortably serving the equivalent of ~20K virtual users or ~4000 req/sec with the equivalent of $0.079/hour worth of Fargate compute that never went beyond 19% avg. CPU utilization or 31% avg. RAM utilization. I truly believe that both the ECS or EKS setups as they exist today could have specific configurations set that would allow them to meet the demands of even national or global scale deployments and serve many many concurrent users.

I'd say it takes about ~9-12 minutes for a leader pod to complete configuration on ECS and pass healthchecks and about ~6-8 minutes for new follower pods to spin up. There's some variation here because it's luck of the draw whether or not you end up running on Graviton 2 vs 3 and you can notice the difference in testing.

I'd say it takes about ~7-9 minutes for a leader pod to complete configuration on EKS and pass healthchecks and about ~4-6 minutes for a follower pod to spin up. There's some variation here because it's luck of the draw as to what specific processor you end up running on but in general I think the processors we end up running on in our EKS architecture are a little bit more performant than those we get allocated on average when we run on ECS.

The ECS version is cheaper and easier to manage than the EKS version while the EKS version offers integration paths to a lot more tools and a lot more functionality around really granular observability out of the box so I think it makes sense to offer both. In general I'd advise using the ECS based one for smaller to mid-size installations and the EKS based one for mid-size to large installations but you can definitely use either the ECS or EKS one at any scale.

While it's improved over time as we've figured out ways to ...

  1. Run OpenEMR better on EKS and ECS

  2. Improve the startup and overall performance of the docker containers (i.e. introducing Opcache, etc. etc.)

we've still had to set pretty generous healthchecks to get OpenEMR to run in a container orchestrator on AWS. I imagine that as other architectures are created in the future to run OpenEMR on container orchestrators elsewhere we'll probably run into similar issues.

It's super normal for applications to take a few minutes to start either as replicas or, especially, when they boot for the first time. However, applications that do take a while to boot and are going to be scheduled on Kubernetes or a similar orchestrator generally try to, as soon as is reasonably possible, start responding to healthchecks on some endpoint that says "this application might not be ready to serve traffic but it is operating normally". An example of this would be an application that starts almost immediately responding to healtchecks at /health but if you go to the website while it's setting up you'll see some sort of landing page that says "Application is Starting" or something to that effect. In our case right now though openemr.sh starts Apache last after all the other setup has completed.

If we can get the containers to reliably respond to healthchecks in <2 minutes when they boot from the perspective of a user or an operator we'd be able to map our autoscaling capacity very closely to the amount of compute we'd need at any one time to serve traffic when running in a cloud and basically all of the nodes would be healthy at any one time because the longest an unhealthy node would be allowed to serve traffic would be <2 minutes. This would make the system cheaper and more reliable overall.

You can get around the long start up times when running today by just setting the autoscaling values lower than you'd probably want to. So instead of setting the average CPU utilization to schedule a scaling event from 70-90% and worry about potentially running up against more traffic than you can serve as you try to spin up new instances what you can instead do is set to be at 40-60% or something similar so that you're more conservative in allocating the amount of compute you have ready to serve traffic at any one time.

Aside from requiring you to spend more money you can get away with using this method to reliably serve even large volumes of OpenEMR traffic when running in a public cloud.

What's perhaps more concerning than cost is the case in which let's say a node fails and we may not have many nodes up at one time because we can serve a lot of traffic with just a few nodes. Because we've had to configure the healthchecks to be really lenient to accommodate for a ~5-7 minute window where a container is running but not responding to healthchecks this can hypothetically lead to a scenarios where a user can experience degraded functionality over a period of a few minutes and not really know why.

Let's imagine a hypothetical situation where let's say 1/4 hosts is doing nothing but serving 5xx responses to users. Neither of our setups need to use sticky sessions and most of the time we're just evenly distributing traffic to our replicas behind a load balancer. This means about 1/4 requests sent by a user don't work so from their perspective they're clicking around in OpenEMR but 1/4 clicks don't work correctly in the UI and they're sort of left scratching their head as to what's going on and it may take a few minutes for that to resolve. It's not the biggest deal in the world but I think if we were to even lower the "time to healthy" by a few minutes it would create a scenario in which basically all of the nodes minus a minute or two here and there were healthy at all times.

The faster a container starts responding to healthchecks the faster we can set up the orchestrator to remove unhealthy nodes and replace them with healthy ones which greatly improves the reliability of the system overall. We'll also save money on autoscaling by allowing us to set higher CPU and RAM thresholds for triggering scale out events.

In most orchestrators (including ECS and EKS) you can set a startup grace period in which you say something to the effect of "This container might take 5 minutes to boot initially but after that first 5 minute grace period do more aggressive healthchecks" which is something we do; so it definitely doesn't take us ~9-11 minutes to identify and remove unhealthy nodes but it's still pretty high relative to most applications you'd run in a cluster so it's something I'm thinking about still.

This PR aims to address three things I've noticed when running OpenEMR as a horizontal replica:

  1. The time to configure the MySQL database the first time can take a while.

  2. The time taken by each container on performing file operations like chmod/chown on startup can take a while and unless you're doing development work I don't think it's necessary to keep these file operations in the production container that's basically just meant to reliably serve traffic.

  3. When starting OpenEMR in a swarm if the leader failed for whatever reason all of the remaining containers would sit there idle waiting for the leader (which no longer exists because it crashed and got swapped out) to create a "docker-completed" file and eventually the whole deployment would fail. You'd end up with three or four containers doing nothing but printing out "Waiting for docker leader to finish configuration" over and over again to logs. This PR's new "heartbeat" system makes it so that a set of containers can recover from this by a) detecting that the leader has failed b) electing a new leader to pick up where configuration left off in the event the former leader crashed

I think we're actually pretty close to the point where any more startup optimizations would be nice to have but not as critical as they are today. That's all while acknowledging that the current setups can perform pretty well today and host OpenEMR with autoscaling for even incredibly large deployments.

What I'd like to introduce in a subsequent PR (or I'd be happy to support anyone who'd like to lead this) would be to modify the containers so that they basically start passing some basic healthcheck as quickly as possible. We can still layer more aggressive and thorough sets of healthchecks on top of that (and even give them grace periods if need be) but having an indication as early as possible that the container is progressing normally even if it's not finished with everything it needs to do would be super helpful.

I think this PR plus maybe one or two more is all we'd need to do to ensure that startup times were fast enough in the production container (Flex containers wouldn't be in scope for this effort) to pass healthchecks quickly and then everything else beyond that would be more of a "nice to have" rather than something that would substantially impact the overall functioning of the system.

@Jmevorach
Copy link
Contributor Author

Hi @kojiromike ! 👋

I pushed an additional commit that introduces a test_suite.sh script, which validates core container functionality. The intent is to pair it with benchmark.sh — so we have one tool for checking performance characteristics and another for ensuring changes don’t break expected behavior.

I wanted to get the test suite in place before diving into the rest of your review comments, since it will help verify things as I iterate. I’ll be leaving the PR marked as WIP while I work through your remaining feedback.

Hope you’re doing well, and thanks again for the thoughtful review!

@Jmevorach Jmevorach changed the title WIP: Optimize OpenEMR 7.0.5 container startup and memory usage Optimize OpenEMR 7.0.5 container startup and memory usage Dec 10, 2025
@Jmevorach
Copy link
Contributor Author

@kojiromike @bradymiller @jesdynf

Finished my edits and removed the WIP!

Happy to make any more edits we'd like; hope this message finds everyone well 😄 .

@Jmevorach
Copy link
Contributor Author

Jmevorach commented Dec 10, 2025

The test for the 7.0.5 container above is failing because of ...

Patient Controller Security (OpenEMR\Tests\Unit\Portal\PatientControllerSecurity)
 ✔ User cannot update other users profile
 ✔ Pid and pubpid cannot be modified by user
 ✔ Authorization check validates session pid
 ✘ I d o r attack prevention
    ┐
    ├ Failed asserting that exception of type "Exception" is thrown.
    │
    │ /var/www/localhost/htdocs/openemr/vendor/bin/phpunit:122
    ┴

Which indicates that the test found here is failing: https://github.com/openemr/openemr/blob/master/tests/Tests/Unit/Portal/PatientControllerSecurityTest.php

The reason the test is failing is that in testIDORAttackPrevention() (starts on line 154), the test expects an exception:

$this->expectException(\Exception::class);
$this->expectExceptionMessage('Unauthorized');

(see lines 177-178)

But it never calls the code that should throw it. The only remaining line of code in the test is:

$this->assertTrue(true, 'IDOR should be blocked by authorization check');

(see line 181)

So no exception is thrown, and the expectation fails. PHPUnit correctly says: “You said an exception would be thrown, but nothing happened.”

This is a test code issue, not an infrastructure problem.

@bradymiller
Copy link
Member

hi @Jmevorach , just fixed the issue in the test code. Will rerun the actions and guessing they will now pass :)

@bradymiller
Copy link
Member

hi @Jmevorach , Very cool stuff! Posted couple notes above.

My main concern is bypassing native openemr database creation since there have been nuances (weird things seems to more likely happen to users that create their database first and bypass the openemr way of doing it) and ongoing changes to this code in the past.

Would it be possible to make these changes in the flex docker? (since goal is for the development environment to be as close as possible to production for development and testing purposes)

@Jmevorach
Copy link
Contributor Author

@bradymiller made all the requested edits and tests pass for the container!

Thanks @bradymiller and @kojiromike for taking the time to review this PR; really appreciate it.

And sure happy to make these changes to the Flex containers after we finalize this merge (just so I know what changes to bring over).

I can open up another PR after this for the Flex containers or I can help someone else if they'd like to open up a PR before/instead of me.

@bradymiller
Copy link
Member

Hi,
Looks good to me.
@kojiromike , should we be enforcing shellcheck stuff or still in process on this repo?

@kojiromike
Copy link
Member

@bradymiller I'd prefer not to introduce any new shellcheck errors. It's hard to tell on this changeset because it's big, but I'd trust @Jmevorach to make good decisions here.

@Jmevorach
Copy link
Contributor Author

@kojiromike @bradymiller Totally understandable! I just put in a commit that should remove all the Shellcheck issues in the new utilities and the new 7.0.5 docker materials.

Re-tested the new container to make sure it works with test_suite.sh and benchmark.sh .

@Jmevorach
Copy link
Contributor Author

@kojiromike @bradymiller

Went in and made it so that all new materials and all files in the 7.0.5 Docker folder pass all tests.

Validated everything is working properly with the benchmark.sh and test_suite.sh utilities also introduced by the PR.

@Jmevorach
Copy link
Contributor Author

Jmevorach commented Dec 16, 2025

Went through the rest of the fsupgrade-*.sh scripts and replicated the same upgrades that were made for fsupgrade-1.sh.

@Jmevorach
Copy link
Contributor Author

Hello all! Hope this message finds everyone well — just a quick note on scope.

At this point I’m not planning to add any additional functionality or files to this PR unless there’s something specific anyone would like changed or added.

The diff is already fairly large (~6k+ lines added, ~1k+ lines removed), and I’d like to keep it as readable and reviewable as possible for the long term.

That said, I’m absolutely happy to make targeted edits, refactors, or follow-ups based on feedback, or split additional work into a separate PR if that’s preferred. Just let me know what would be most helpful.

@kojiromike
Copy link
Member

@Jmevorach Thank you so much for this great update. I have to confess I'm probably not the target audience to actually test and merge it. OpenCoreEMR uses neither the openemr official images nor openemr.sh for real-world deployments. The only place I do use the official images is in development dockers, which use the flex images, not the prod 7.0.X ones.

I'm probably going to leave this up to @jesdynf and @bradymiller, but if you know of a way I can help test within my current workflow I'm happy to try.

@Jmevorach Jmevorach closed this Dec 19, 2025
@Jmevorach
Copy link
Contributor Author

Thanks @kojiromike! It's really cool to be a part of OpenEMR's contributors and I enjoy every time I get to contribute 😄 .

This is a project I really believe in and want to support.

No rush! I accidentally wiped the commit history on Github while I was poking around earlier so I reopened another PR using a backup here: #529

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants