-
-
Notifications
You must be signed in to change notification settings - Fork 171
Optimize OpenEMR 7.0.5 container startup and memory usage #509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
kojiromike
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.
Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.
The other stuff is about taking better advantage of current docker buildtime features like caching.
And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.
kojiromike
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. I'm looking forward to getting more benchmarking data out of the gate.
I hope you don't mind my reviewing it early despite the WIP marker. I just don't know when I'll get another chance.
Mainly my comments are around taking better advantage of bash. If we're not going to try to be POSIX compliant then we should take full advantage of bash features, particularly integer arithmetic and shopts like nullglob, globstar and maybe extglob.
The other stuff is about taking better advantage of current docker buildtime features like caching.
And maybe it's not a conversation for today, but maybe we don't need to have bash, python and php on the image. Some day, we should consider rewriting these tools in a single language. We can take advantage of Symfony CLI and other libs already included in openemr to avoid adding more overhead or dependency management.
|
Thank you for taking the time to go through all this @kojiromike ; really appreciate your time. I'll be busy for a little bit but I should have time in ~2 weeks to start going through all the comments you left and making edits. If anyone wants to step in before then and start making edits themselves please feel free; I welcome all the help I can get on this PR. The time from container start to when it passes healthchecks is something really important to me because the shorter that time is the better our autoscaling will be when running multiple horizontal replicas. If it takes 5 minutes to spin up a new node it doesn't mean you can't autoscale but it means that you won't be able to autoscale as well as if it took 1 minute. There's some things you can do with Karpenter and other autoscalers to autoscale performantly even when startup time is relatively long (things like having pre-spun up hosts ready to serve traffic waiting rather than spinning up hosts from scratch every time there's an autoscaling event) but reducing the "time from start to healthy" will improve horizontal autoscaling no matter where OpenEMR is run. While reducing the size of the container image is good in general (especially because it helps reduces some of the hardware specs you'd need to run OpenEMR using docker and the time it takes to pull the container); the rise of container caching technologies (specifically for Kubernetes and basically anywhere else you'd host this in a large public cloud provider) means that this is something I'm planning to prioritize less than reducing the amount of operations the container has to perform when it starts because I think this is going to impact startup time way more than container image size. I feel the same way about reducing the build times of the containers. Reducing build times is never a bad thing (especially because it allows developers to test more easily) but I think that prioritizing performing the least amount of operations and performing those operations as quickly as possible on container start (especially if those operations are the same each time and can be accomplished at Docker build time or by making configurations to repositories where build materials are pulled from) is going to give us the greatest "value for effort" for the time being. I think the fastest way to startup would basically be to ...
What I'd really like to do is introduce automated CI/CD for container builds that verifies that all of this functionality works as its supposed to:
I've managed to make a "high availability" docker compose setup (put high availability in quotes because even if you have two containers running, running both of them on the same host is not really high availability) that launches two containers behind a load balancer in docker-compose which can be used to test that Swarm node configuration works so I'm making some progress in this regard but, again, I'll more than happily take all the help I can on pushing this forward. The benchmarking is a good first step because it allows us to quickly verify the performance impacts of a change to one of the containers but as long as we have to more or less manually verify that the functionality above works as expected there's only going to be so fast we'll be able to iterate. By the way thanks for all your work in introducing all the unit tests you've introduced so far. I'm a big fan of what's going on in the "tests" folder in the "openemr/openemr" repo and everything going on in the ".github" folder in this repo. If we have CI/CD that verifies all the functionality the container should have and uses the benchmarking utilities then the process to find optimizations like this becomes much easier. Anyone would be able to fork the repository, make a change, and then know in minutes whether or not that change improved the container and maintained the functionality we need to properly run OpenEMR. |
Agreed
Let me see if I understand. Is openemr.sh slow when the application is already set up and ready to go, and we're just scaling it horizontally by adding another node in basically |
|
Both. Most of my contributions to the OpenEMR containers come from a perspective similar to an operator of a cluster (be it Kubernetes or otherwise in the case of ECS) who runs OpenEMR on behalf of a larger organization. Sort of like if you were a DevOps engineer hired by a hospital and tasked with operating a cluster running OpenEMR. The difference is rather than me being concerned about a specific installation I'm basically trying to make machines that print out highly reliable setups for other operators. Most of the time I try to do that by acting like I'm an operator myself and stress testing these setups I'm printing out to see what I need to improve with the machine that makes them. When we first started making OpenEMR on ECS we noticed that the deployment was failing because the healthchecks were timing out. At first we thought this was because the containers had failed some setup task and so they were just sitting idle until they failed healthchecks and were terminated but then we realized that this was because the containers were just taking a really long time to start up. When we run on ECS we use Fargate and run on a collection of Graviton 2 and Graviton 3 processors in whatever AWS region you decide to use. When we run on EKS we tend to use (although auto-mode defines node-pools with multiple types of instances) Graviton 3 or 4. I ended up deciding on Graviton because it was 1/ cheaper than other comparable instance types 2/ There's been a bunch of optimization work done for running PHP on Graviton and I found it generally performed better than comparable x86 based instances. Also AWS has a lot of ARM chips so from the perspective of autoscaling I think the chance that one of our clusters is going to ask an AWS data center for another ARM vCPU and for that data center to say "sorry we're all out" is pretty low. This allows us to operate OpenEMR at an incredibly large scale and to do so relatively affordably. The initial load testing results we got from using just the ECS version alone (which is arguably less performant overall than the EKS setup) show us comfortably serving the equivalent of ~20K virtual users or ~4000 req/sec with the equivalent of $0.079/hour worth of Fargate compute that never went beyond 19% avg. CPU utilization or 31% avg. RAM utilization. I truly believe that both the ECS or EKS setups as they exist today could have specific configurations set that would allow them to meet the demands of even national or global scale deployments and serve many many concurrent users. I'd say it takes about ~9-12 minutes for a leader pod to complete configuration on ECS and pass healthchecks and about ~6-8 minutes for new follower pods to spin up. There's some variation here because it's luck of the draw whether or not you end up running on Graviton 2 vs 3 and you can notice the difference in testing. I'd say it takes about ~7-9 minutes for a leader pod to complete configuration on EKS and pass healthchecks and about ~4-6 minutes for a follower pod to spin up. There's some variation here because it's luck of the draw as to what specific processor you end up running on but in general I think the processors we end up running on in our EKS architecture are a little bit more performant than those we get allocated on average when we run on ECS. The ECS version is cheaper and easier to manage than the EKS version while the EKS version offers integration paths to a lot more tools and a lot more functionality around really granular observability out of the box so I think it makes sense to offer both. In general I'd advise using the ECS based one for smaller to mid-size installations and the EKS based one for mid-size to large installations but you can definitely use either the ECS or EKS one at any scale. While it's improved over time as we've figured out ways to ...
we've still had to set pretty generous healthchecks to get OpenEMR to run in a container orchestrator on AWS. I imagine that as other architectures are created in the future to run OpenEMR on container orchestrators elsewhere we'll probably run into similar issues. It's super normal for applications to take a few minutes to start either as replicas or, especially, when they boot for the first time. However, applications that do take a while to boot and are going to be scheduled on Kubernetes or a similar orchestrator generally try to, as soon as is reasonably possible, start responding to healthchecks on some endpoint that says "this application might not be ready to serve traffic but it is operating normally". An example of this would be an application that starts almost immediately responding to healtchecks at If we can get the containers to reliably respond to healthchecks in <2 minutes when they boot from the perspective of a user or an operator we'd be able to map our autoscaling capacity very closely to the amount of compute we'd need at any one time to serve traffic when running in a cloud and basically all of the nodes would be healthy at any one time because the longest an unhealthy node would be allowed to serve traffic would be <2 minutes. This would make the system cheaper and more reliable overall. You can get around the long start up times when running today by just setting the autoscaling values lower than you'd probably want to. So instead of setting the average CPU utilization to schedule a scaling event from 70-90% and worry about potentially running up against more traffic than you can serve as you try to spin up new instances what you can instead do is set to be at 40-60% or something similar so that you're more conservative in allocating the amount of compute you have ready to serve traffic at any one time. Aside from requiring you to spend more money you can get away with using this method to reliably serve even large volumes of OpenEMR traffic when running in a public cloud. What's perhaps more concerning than cost is the case in which let's say a node fails and we may not have many nodes up at one time because we can serve a lot of traffic with just a few nodes. Because we've had to configure the healthchecks to be really lenient to accommodate for a ~5-7 minute window where a container is running but not responding to healthchecks this can hypothetically lead to a scenarios where a user can experience degraded functionality over a period of a few minutes and not really know why. Let's imagine a hypothetical situation where let's say 1/4 hosts is doing nothing but serving 5xx responses to users. Neither of our setups need to use sticky sessions and most of the time we're just evenly distributing traffic to our replicas behind a load balancer. This means about 1/4 requests sent by a user don't work so from their perspective they're clicking around in OpenEMR but 1/4 clicks don't work correctly in the UI and they're sort of left scratching their head as to what's going on and it may take a few minutes for that to resolve. It's not the biggest deal in the world but I think if we were to even lower the "time to healthy" by a few minutes it would create a scenario in which basically all of the nodes minus a minute or two here and there were healthy at all times. The faster a container starts responding to healthchecks the faster we can set up the orchestrator to remove unhealthy nodes and replace them with healthy ones which greatly improves the reliability of the system overall. We'll also save money on autoscaling by allowing us to set higher CPU and RAM thresholds for triggering scale out events. In most orchestrators (including ECS and EKS) you can set a startup grace period in which you say something to the effect of "This container might take 5 minutes to boot initially but after that first 5 minute grace period do more aggressive healthchecks" which is something we do; so it definitely doesn't take us ~9-11 minutes to identify and remove unhealthy nodes but it's still pretty high relative to most applications you'd run in a cluster so it's something I'm thinking about still. This PR aims to address three things I've noticed when running OpenEMR as a horizontal replica:
I think we're actually pretty close to the point where any more startup optimizations would be nice to have but not as critical as they are today. That's all while acknowledging that the current setups can perform pretty well today and host OpenEMR with autoscaling for even incredibly large deployments. What I'd like to introduce in a subsequent PR (or I'd be happy to support anyone who'd like to lead this) would be to modify the containers so that they basically start passing some basic healthcheck as quickly as possible. We can still layer more aggressive and thorough sets of healthchecks on top of that (and even give them grace periods if need be) but having an indication as early as possible that the container is progressing normally even if it's not finished with everything it needs to do would be super helpful. I think this PR plus maybe one or two more is all we'd need to do to ensure that startup times were fast enough in the production container (Flex containers wouldn't be in scope for this effort) to pass healthchecks quickly and then everything else beyond that would be more of a "nice to have" rather than something that would substantially impact the overall functioning of the system. |
|
Hi @kojiromike ! 👋 I pushed an additional commit that introduces a test_suite.sh script, which validates core container functionality. The intent is to pair it with benchmark.sh — so we have one tool for checking performance characteristics and another for ensuring changes don’t break expected behavior. I wanted to get the test suite in place before diving into the rest of your review comments, since it will help verify things as I iterate. I’ll be leaving the PR marked as WIP while I work through your remaining feedback. Hope you’re doing well, and thanks again for the thoughtful review! |
|
@kojiromike @bradymiller @jesdynf Finished my edits and removed the WIP! Happy to make any more edits we'd like; hope this message finds everyone well 😄 . |
|
The test for the 7.0.5 container above is failing because of ... Which indicates that the test found here is failing: https://github.com/openemr/openemr/blob/master/tests/Tests/Unit/Portal/PatientControllerSecurityTest.php The reason the test is failing is that in (see lines 177-178) But it never calls the code that should throw it. The only remaining line of code in the test is: (see line 181) So no exception is thrown, and the expectation fails. PHPUnit correctly says: “You said an exception would be thrown, but nothing happened.” This is a test code issue, not an infrastructure problem. |
|
hi @Jmevorach , just fixed the issue in the test code. Will rerun the actions and guessing they will now pass :) |
|
hi @Jmevorach , Very cool stuff! Posted couple notes above. My main concern is bypassing native openemr database creation since there have been nuances (weird things seems to more likely happen to users that create their database first and bypass the openemr way of doing it) and ongoing changes to this code in the past. Would it be possible to make these changes in the flex docker? (since goal is for the development environment to be as close as possible to production for development and testing purposes) |
|
@bradymiller made all the requested edits and tests pass for the container! Thanks @bradymiller and @kojiromike for taking the time to review this PR; really appreciate it. And sure happy to make these changes to the Flex containers after we finalize this merge (just so I know what changes to bring over). I can open up another PR after this for the Flex containers or I can help someone else if they'd like to open up a PR before/instead of me. |
|
Hi, |
|
@bradymiller I'd prefer not to introduce any new shellcheck errors. It's hard to tell on this changeset because it's big, but I'd trust @Jmevorach to make good decisions here. |
|
@kojiromike @bradymiller Totally understandable! I just put in a commit that should remove all the Shellcheck issues in the new utilities and the new 7.0.5 docker materials. Re-tested the new container to make sure it works with test_suite.sh and benchmark.sh . |
|
Went in and made it so that all new materials and all files in the 7.0.5 Docker folder pass all tests. Validated everything is working properly with the |
|
Went through the rest of the |
|
Hello all! Hope this message finds everyone well — just a quick note on scope. At this point I’m not planning to add any additional functionality or files to this PR unless there’s something specific anyone would like changed or added. The diff is already fairly large (~6k+ lines added, ~1k+ lines removed), and I’d like to keep it as readable and reviewable as possible for the long term. That said, I’m absolutely happy to make targeted edits, refactors, or follow-ups based on feedback, or split additional work into a separate PR if that’s preferred. Just let me know what would be most helpful. |
|
@Jmevorach Thank you so much for this great update. I have to confess I'm probably not the target audience to actually test and merge it. OpenCoreEMR uses neither the openemr official images nor openemr.sh for real-world deployments. The only place I do use the official images is in development dockers, which use the flex images, not the prod 7.0.X ones. I'm probably going to leave this up to @jesdynf and @bradymiller, but if you know of a way I can help test within my current workflow I'm happy to try. |
|
Thanks @kojiromike! It's really cool to be a part of OpenEMR's contributors and I enjoy every time I get to contribute 😄 . This is a project I really believe in and want to support. No rush! I accidentally wiped the commit history on Github while I was poking around earlier so I reopened another PR using a backup here: #529 |
Refactors the OpenEMR 7.0.5 Docker image for significantly improved startup performance and reduced memory footprint. This is a work-in-progress that requires community testing to verify all container functionality before merging.
Performance Improvements
Benchmarking against the current Docker Hub image (
openemr/openemr:7.0.5) shows:Technical Changes
Build-Time Permission Optimization (Primary)
chmod/chown) from runtime to build timeStartup Script Modernization
openemr.shfrom POSIX sh to bash withset -euo pipefailDockerfile Documentation
Testing Needed
This PR needs help verifying that all OpenEMR container functions work correctly:
How to Test
Build and test the optimized image:
cd docker/openemr/7.0.5
docker build -t openemr:7.0.5-optimized .
docker compose up -d
Or use the benchmarking utility to compare performance: cd utilities/container_benchmarking
./benchmark.sh
Related Files
docker/openemr/7.0.5/*- Optimized container filesutilities/container_benchmarking/- Benchmarking suite for performance validationSummary of the Container Benchmarking Utility
The Container Benchmarking Utility (
utilities/container_benchmarking/) is a valuable development tool for the OpenEMR project:What It Does
Why It's Useful for OpenEMR Container Development
summary.sh,compare_results.sh, and CSV export for deeper analysisRecommended Use Cases
The utility democratizes performance testing—any contributor can run benchmarks locally and share quantified results, making it easier to evaluate and merge container improvements.