|
| 1 | +**DO NOT READ THIS FILE ON GITHUB, GUIDES ARE PUBLISHED ON https://guides.rubyonrails.org.** |
| 2 | + |
| 3 | +Tuning Performance for Deployment |
| 4 | +================================= |
| 5 | + |
| 6 | +This guide covers performance and concurrency configuration for deploying your production Ruby on Rails application. |
| 7 | + |
| 8 | +After reading this guide, you will know: |
| 9 | + |
| 10 | +* Whether to use Puma, the default application server |
| 11 | +* How to configure important performance settings for Puma |
| 12 | +* How to begin performance testing your application settings |
| 13 | + |
| 14 | +This guide focuses on web servers, which are the primary performance-sensitive component of most web applications. Other |
| 15 | +components like background jobs and WebSockets can be tuned but won't be covered by this guide. |
| 16 | + |
| 17 | +More information about how to configure your application can be found in the [Configuration Guide](configuring.html). |
| 18 | + |
| 19 | +-------------------------------------------------------------------------------- |
| 20 | + |
| 21 | +This guide assumes you are running [MRI](https://ruby-lang.org), the canonical implementation of Ruby also known as |
| 22 | +CRuby. If you're using another Ruby implementation such as JRuby or TruffleRuby, most of this guide doesn't apply. |
| 23 | +If needed, check sources specific to your Ruby implementation. |
| 24 | + |
| 25 | +Choosing an Application Server |
| 26 | +------------------------------ |
| 27 | + |
| 28 | +Puma is Rails' default application server and the most commonly used server across the community. |
| 29 | +It works well in most cases. In some cases, you may wish to change to another. |
| 30 | + |
| 31 | +An application server uses a particular concurrency method. |
| 32 | +For example, Unicorn uses processes, Puma and Passenger are hybrid process- and thread-based concurrency, and Falcon |
| 33 | +uses fibers. |
| 34 | + |
| 35 | +A full discussion of Ruby's concurrency methods is beyond the scope of this document, but the key tradeoffs between |
| 36 | +processes and threads will be presented. |
| 37 | +If you want to use a method other than processes and threads, you will need to use a different application server. |
| 38 | + |
| 39 | +This guide will focus on how to tune Puma. |
| 40 | + |
| 41 | +What to Optimize for? |
| 42 | +------------------------------ |
| 43 | + |
| 44 | +In essence, tuning a Ruby web server is making a tradeoff between multiple properties such as memory usage, throughput, |
| 45 | +and latency. |
| 46 | + |
| 47 | +The throughput is the measure of how many requests per second the server can handle, and latency is the measure of how |
| 48 | +long individual requests take (also referred to as response time). |
| 49 | + |
| 50 | +Some users may want to maximize throughput to keep their hosting cost low, some other users may want to minimize latency |
| 51 | +to offer the best user experience, and many users will search for some compromise somewhere in the middle. |
| 52 | + |
| 53 | +It is important to understand that optimizing for one property will generally hurt at least another one. |
| 54 | + |
| 55 | +### Understanding Ruby's Concurrency and Parallelism |
| 56 | + |
| 57 | +[CRuby](https://www.ruby-lang.org/en/) has a [Global Interpreter Lock](https://en.wikipedia.org/wiki/Global_interpreter_lock), |
| 58 | +often called the GVL or GIL. |
| 59 | +The GVL prevents multiple threads from running Ruby code at the same time in a single process. |
| 60 | +Multiple threads can be waiting on network data, database operations, or some other non-Ruby work generally referred to |
| 61 | +as I/O operations, but only one can actively run Ruby code at a time. |
| 62 | + |
| 63 | +This means that thread-based concurrency allows for increased throughput by concurrently processing web requests |
| 64 | +whenever they do I/O operations, but may degrade latency whenever an I/O operation completes. The thread that performed |
| 65 | +it may have to wait before it can resume executing Ruby code. |
| 66 | +Similarly, Ruby's garbage collector is "stop-the-world" so when it triggers all threads have to stop. |
| 67 | + |
| 68 | +This also means that regardless of how many threads a Ruby process contains, it will never use more than a single CPU |
| 69 | +core. |
| 70 | + |
| 71 | +Because of this, if your application only spends 50% of its time doing I/O operations, using more than 2 or 3 threads |
| 72 | +per process may severely hurt latency, and the gains in throughput will quickly hit diminishing returns. |
| 73 | + |
| 74 | +Generally speaking, a well-crafted Rails application that isn't suffering from slow SQL queries or N+1 problems doesn't |
| 75 | +spend more than 50% of its time doing I/O operations, hence is unlikely to benefit from more than 3 threads. |
| 76 | +However, some applications that do call third-party APIs inline may spend a very large proportion of their time doing |
| 77 | +I/O operations and may benefit from more threads than that. |
| 78 | + |
| 79 | +The way to achieve true parallelism with Ruby is to use multiple processes. As long as there is a free CPU core, Ruby |
| 80 | +processes don't have to wait on one another before resuming execution after an I/O operation is complete. |
| 81 | +However, processes only share a fraction of their memory via [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write), |
| 82 | +so one additional process uses more memory than an additional thread would. |
| 83 | + |
| 84 | +Note that while threads are cheaper than processes, they are not free, and increasing the number of threads per process, |
| 85 | +also increases memory usage. |
| 86 | + |
| 87 | +### Practical Implications |
| 88 | + |
| 89 | +Users interested in optimizing for throughput and server utilization will want to run one process per CPU core and |
| 90 | +increase the number of threads per process until the impact on latency is dimmed too important. |
| 91 | + |
| 92 | +Users interested in optimizing for latency will want to keep the number of threads per process low. |
| 93 | +To optimize for latency even further, users can even set the thread per process count to `1` and run `1.5` or `1.3` |
| 94 | +process per CPU core to account for when processes are idle waiting for I/O operations. |
| 95 | + |
| 96 | +It is important to note that some hosting solutions may only offer a relatively small amount of memory (RAM) per CPU |
| 97 | +core, preventing you from running as many processes as needed to use all CPU cores. |
| 98 | +However, most hosting solutions have different plans with different ratios of memory and CPU. |
| 99 | + |
| 100 | +Another thing to consider is that Ruby memory usage benefits from economies of scale thanks to |
| 101 | +[copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write). |
| 102 | +So `2` servers with `32` Ruby processes each will use less memory per CPU core than `16` servers with `4` Ruby processes |
| 103 | +each. |
| 104 | + |
| 105 | +Configurations |
| 106 | +-------------- |
| 107 | + |
| 108 | +### Puma |
| 109 | + |
| 110 | +The Puma configuration resides in the `config/puma.rb` file. |
| 111 | +The two most important Puma configurations are the number of threads per process, and the number of processes, |
| 112 | +which Puma calls `workers`. |
| 113 | + |
| 114 | +The number of threads per process is configured via the `thread` directive. |
| 115 | +In the default generated configuration, it is set to `3`. |
| 116 | +You can modify it either by setting the `RAILS_MAX_THREADS` environment variable or simply editing the configuration |
| 117 | +file. |
| 118 | + |
| 119 | +The number of processes is configured by the `workers` directive. |
| 120 | +If you use more than one thread per process, then it should be set to how many CPU cores are available on the server, |
| 121 | +or if the server is running multiple applications, to how many cores you want the application to use. |
| 122 | +If you only use one thread per worker, then you can increase it to above one per process to account for when workers are |
| 123 | +idle waiting for I/O operations. |
| 124 | +In the default generated configuration, it is set to use all the available processor cores on the server via the |
| 125 | +`Concurrent.available_processor_count` helper. You can also modify it by setting the `WEB_CONCURRENCY` environment variable. |
| 126 | + |
| 127 | +### YJIT |
| 128 | + |
| 129 | +Recent Ruby versions come with a [Just-in-time compiler](https://en.wikipedia.org/wiki/Just-in-time_compilation) |
| 130 | +called [`YJIT`](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md). |
| 131 | + |
| 132 | +Without going into too many details, JIT compilers allow to execute code faster, at the expense of using some more |
| 133 | +memory. |
| 134 | +Unless you really cannot spare this extra memory usage, it is highly recommended to enable YJIT. |
| 135 | + |
| 136 | +As for Rails 7.2, if your application is running on Ruby 3.3 or superior, YJIT will automatically be enabled by Rails |
| 137 | +by default. |
| 138 | +Older versions of Rails or Ruby have to enable it manually, please refer to the |
| 139 | +[`YJIT documentation`](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md) about how to do it. |
| 140 | + |
| 141 | +If the extra memory usage is a problem, before entirely disabling YJIT, you can try tuning it to use less memory via |
| 142 | +[the `--yjit-exec-mem-size` configuration](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#decreasing---yjit-exec-mem-size). |
| 143 | + |
| 144 | +### Memory Allocators and Configuration |
| 145 | + |
| 146 | +Because of how the default memory allocator works on most Linux distributions, running Puma with multiple threads can |
| 147 | +lead to an unexpected increase in memory usage caused by [memory fragmentation](https://en.wikipedia.org/wiki/Fragmentation_\(computing\)). |
| 148 | +In turn, this increased memory usage may prevent your application from fully utilizing the server CPU cores. |
| 149 | + |
| 150 | +To alleviate this problem, it is highly recommended the configure Ruby to use an alternative memory allocator: |
| 151 | +[jemalloc](https://github.com/jemalloc/jemalloc). |
| 152 | + |
| 153 | +The default Dockerfile generated by Rails already comes preconfigured to install and use `jemalloc`. But if your hosting |
| 154 | +solution isn't Docker based, you should look into how to install and enable jemalloc there. |
| 155 | + |
| 156 | +If for some reason that isn't possible, a less efficient alternative is to configure the default allocator in a way that |
| 157 | +reduces memory fragmentation by setting `MALLOC_ARENA_MAX=2` in your environment. |
| 158 | +Note however that this might make Ruby slower, so `jemalloc` is the preferred solution. |
| 159 | + |
| 160 | +Performance Testing |
| 161 | +-------------------- |
| 162 | + |
| 163 | +Because every Rails application is different, and that every Rails user may want to optimize for different properties, |
| 164 | +it is impossible to offer a default configuration or guidelines that works best for everyone. |
| 165 | + |
| 166 | +Hence, the best way to choose your application's settings is to measure the performance of your application, and adjust |
| 167 | +the configuration until it is satisfactory for your goals. |
| 168 | + |
| 169 | +This can be done with a simulated production workload, or directly in production with live application traffic. |
| 170 | + |
| 171 | +Performance testing is a deep subject. This guide gives only simple guidelines. |
| 172 | + |
| 173 | +### What to Measure |
| 174 | + |
| 175 | +Throughput is the number of requests per second that your application successfully processes. |
| 176 | +Any good load testing program will measure it. |
| 177 | +A throughput is normally a single number expressed in "requests per second". |
| 178 | + |
| 179 | +Latency is the delay from the time the request is sent until its response is successfully received, generally expressed |
| 180 | +in milliseconds. |
| 181 | +Each individual request will have its own latency. |
| 182 | + |
| 183 | +[Percentile](https://en.wikipedia.org/wiki/Percentile_rank) latency gives the latency where a certain percentage of |
| 184 | +requests have better latency than that. |
| 185 | +For instance, `P90` is the 90th-percentile latency. |
| 186 | +The `P90` is the latency for a single load test where only 10% of requests took longer than that to process. |
| 187 | +The `P50` is the latency such that half your requests were slower, also called the median latency. |
| 188 | + |
| 189 | +"Tail latency" refers to high-percentile latencies. |
| 190 | +For instance, the `P99` is the latency such that only 1% of your requests were worse. |
| 191 | +`P99` is a tail latency. |
| 192 | +`P50` is not a tail latency. |
| 193 | + |
| 194 | +Generally speaking, the average latency isn't a good metric to optimize for. |
| 195 | +It is best to focus on median (`P50`) and tail (`P95` or `P99`) latency. |
| 196 | + |
| 197 | +### Production Measurement |
| 198 | + |
| 199 | +If your production environment includes more than one server, it can be a good idea to do |
| 200 | +[A/B testing](https://en.wikipedia.org/wiki/A/B_testing) there. |
| 201 | +For instance, you could run half of the servers with `3` threads per process, and the other half with `4` threads per |
| 202 | +process, and then use an application performance monitoring service to compare the throughput and latency of the two |
| 203 | +groups. |
| 204 | + |
| 205 | +Application performance monitoring services are numerous, some are self-hosted, some are cloud solutions, and many offer |
| 206 | +a free tier plan. |
| 207 | +Recommending a particular one is beyond the scope of this guide. |
| 208 | + |
| 209 | +### Load Testers |
| 210 | + |
| 211 | +You will need a load testing program to make requests of your application. |
| 212 | +This can be a dedicated load testing program of some kind, or you can write a small application to make HTTP requests |
| 213 | +and track how long they take. |
| 214 | +You should not normally check the time in your Rails log file. |
| 215 | +That time is only how long Rails took to process the request. It does not include time taken by the application server. |
| 216 | + |
| 217 | +Sending many simultaneous requests and timing them can be difficult. It is easy to introduce subtle measurement errors. |
| 218 | +Normally you should use a load testing program, not write your own. Many load testers are simple to use and many |
| 219 | +excellent load testers are free. |
| 220 | + |
| 221 | +### What You Can Change |
| 222 | + |
| 223 | +You can change the number of threads in your test to find the best tradeoff between throughput and latency for your |
| 224 | +application. |
| 225 | + |
| 226 | +Larger hosts with more memory and CPU cores will need more processes for best usage. |
| 227 | +You can vary the size and type of hosts from a hosting provider. |
| 228 | + |
| 229 | +Increasing the number of iterations will usually give a more exact answer, but require longer for testing. |
| 230 | + |
| 231 | +You should test on the same type of host that will run in production. |
| 232 | +Testing on your development machine will only tell you what settings are best for that development machine. |
| 233 | + |
| 234 | +### Warmup |
| 235 | + |
| 236 | +Your application should process a number of requests after startup that are not included in your final measurements. |
| 237 | +These requests are called "warmup" requests, and are usually much slower than later "steady-state" requests. |
| 238 | + |
| 239 | +Your load testing program will usually support warmup requests. You can also run it more than once and throw away the |
| 240 | +first set of times. |
| 241 | + |
| 242 | +You have enough warmup requests when increasing the number does not significantly change your result. |
| 243 | +[The theory behind this can be complicated](https://arxiv.org/abs/1602.00602) but most common situations are |
| 244 | +straightforward: test several times with different amounts of warmup. See how many warmup iterations are needed before |
| 245 | +the results stay roughly the same. |
| 246 | + |
| 247 | +Very long warmup can be useful for testing memory fragmentation and other issues that happen only after many requests. |
| 248 | + |
| 249 | +### Which Requests |
| 250 | + |
| 251 | +Your application probably accepts many different HTTP requests. |
| 252 | +You should begin by load testing with just a few of them. |
| 253 | +You can add more kinds of requests over time. |
| 254 | +If a particular kind of request is too slow in your production application, you can add it to your load testing code. |
| 255 | + |
| 256 | +A synthetic workload cannot perfectly match your application's production traffic. |
| 257 | +It is still helpful for testing configurations. |
| 258 | + |
| 259 | +### What to Look For |
| 260 | + |
| 261 | +Your load testing program should allow you to check latencies, including percentile and tail latencies. |
| 262 | + |
| 263 | +For different numbers of processes and threads, or different configurations in general, check the throughput and one or |
| 264 | +more latencies such as P50, P90, and P99. |
| 265 | +Increasing the threads will improve throughput up to a point, but worsen latency. |
| 266 | + |
| 267 | +Choose a tradeoff between latency and throughput based on your application's needs. |
0 commit comments