Skip to content

Commit ea757ae

Browse files
authored
Merge pull request rails#51924 from Shopify/performance-tuning-guide
Add a Rails Guide called "Tuning Performance for Deployment" (second version)
2 parents b9f814a + f719787 commit ea757ae

File tree

5 files changed

+277
-7
lines changed

5 files changed

+277
-7
lines changed

Gemfile.lock

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -89,7 +89,7 @@ PATH
8989
activesupport (8.0.0.alpha)
9090
base64
9191
bigdecimal
92-
concurrent-ruby (~> 1.0, >= 1.0.2)
92+
concurrent-ruby (~> 1.0, >= 1.3.1)
9393
connection_pool (>= 2.2.5)
9494
drb
9595
i18n (>= 1.6, < 2)
@@ -182,7 +182,7 @@ GEM
182182
cgi (0.4.1)
183183
chef-utils (18.3.0)
184184
concurrent-ruby
185-
concurrent-ruby (1.2.2)
185+
concurrent-ruby (1.3.1)
186186
connection_pool (2.4.1)
187187
crack (0.4.5)
188188
rexml

activesupport/activesupport.gemspec

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ Gem::Specification.new do |s|
3636

3737
s.add_dependency "i18n", ">= 1.6", "< 2"
3838
s.add_dependency "tzinfo", "~> 2.0", ">= 2.0.5"
39-
s.add_dependency "concurrent-ruby", "~> 1.0", ">= 1.0.2"
39+
s.add_dependency "concurrent-ruby", "~> 1.0", ">= 1.3.1"
4040
s.add_dependency "connection_pool", ">= 2.2.5"
4141
s.add_dependency "minitest", ">= 5.1"
4242
s.add_dependency "base64"

guides/source/documents.yaml

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -252,6 +252,10 @@
252252
name: Composite Primary Keys
253253
url: active_record_composite_primary_keys.html
254254
description: This guide is an introduction to composite primary keys for database tables.
255+
-
256+
name: Tuning Performance for Deployment
257+
url: tuning_performance_for_deployment.md
258+
description: This guide covers performance and concurrency configuration for deploying your production Ruby on Rails application.
255259

256260
-
257261
name: Extending Rails
Lines changed: 267 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,267 @@
1+
**DO NOT READ THIS FILE ON GITHUB, GUIDES ARE PUBLISHED ON https://guides.rubyonrails.org.**
2+
3+
Tuning Performance for Deployment
4+
=================================
5+
6+
This guide covers performance and concurrency configuration for deploying your production Ruby on Rails application.
7+
8+
After reading this guide, you will know:
9+
10+
* Whether to use Puma, the default application server
11+
* How to configure important performance settings for Puma
12+
* How to begin performance testing your application settings
13+
14+
This guide focuses on web servers, which are the primary performance-sensitive component of most web applications. Other
15+
components like background jobs and WebSockets can be tuned but won't be covered by this guide.
16+
17+
More information about how to configure your application can be found in the [Configuration Guide](configuring.html).
18+
19+
--------------------------------------------------------------------------------
20+
21+
This guide assumes you are running [MRI](https://ruby-lang.org), the canonical implementation of Ruby also known as
22+
CRuby. If you're using another Ruby implementation such as JRuby or TruffleRuby, most of this guide doesn't apply.
23+
If needed, check sources specific to your Ruby implementation.
24+
25+
Choosing an Application Server
26+
------------------------------
27+
28+
Puma is Rails' default application server and the most commonly used server across the community.
29+
It works well in most cases. In some cases, you may wish to change to another.
30+
31+
An application server uses a particular concurrency method.
32+
For example, Unicorn uses processes, Puma and Passenger are hybrid process- and thread-based concurrency, and Falcon
33+
uses fibers.
34+
35+
A full discussion of Ruby's concurrency methods is beyond the scope of this document, but the key tradeoffs between
36+
processes and threads will be presented.
37+
If you want to use a method other than processes and threads, you will need to use a different application server.
38+
39+
This guide will focus on how to tune Puma.
40+
41+
What to Optimize for?
42+
------------------------------
43+
44+
In essence, tuning a Ruby web server is making a tradeoff between multiple properties such as memory usage, throughput,
45+
and latency.
46+
47+
The throughput is the measure of how many requests per second the server can handle, and latency is the measure of how
48+
long individual requests take (also referred to as response time).
49+
50+
Some users may want to maximize throughput to keep their hosting cost low, some other users may want to minimize latency
51+
to offer the best user experience, and many users will search for some compromise somewhere in the middle.
52+
53+
It is important to understand that optimizing for one property will generally hurt at least another one.
54+
55+
### Understanding Ruby's Concurrency and Parallelism
56+
57+
[CRuby](https://www.ruby-lang.org/en/) has a [Global Interpreter Lock](https://en.wikipedia.org/wiki/Global_interpreter_lock),
58+
often called the GVL or GIL.
59+
The GVL prevents multiple threads from running Ruby code at the same time in a single process.
60+
Multiple threads can be waiting on network data, database operations, or some other non-Ruby work generally referred to
61+
as I/O operations, but only one can actively run Ruby code at a time.
62+
63+
This means that thread-based concurrency allows for increased throughput by concurrently processing web requests
64+
whenever they do I/O operations, but may degrade latency whenever an I/O operation completes. The thread that performed
65+
it may have to wait before it can resume executing Ruby code.
66+
Similarly, Ruby's garbage collector is "stop-the-world" so when it triggers all threads have to stop.
67+
68+
This also means that regardless of how many threads a Ruby process contains, it will never use more than a single CPU
69+
core.
70+
71+
Because of this, if your application only spends 50% of its time doing I/O operations, using more than 2 or 3 threads
72+
per process may severely hurt latency, and the gains in throughput will quickly hit diminishing returns.
73+
74+
Generally speaking, a well-crafted Rails application that isn't suffering from slow SQL queries or N+1 problems doesn't
75+
spend more than 50% of its time doing I/O operations, hence is unlikely to benefit from more than 3 threads.
76+
However, some applications that do call third-party APIs inline may spend a very large proportion of their time doing
77+
I/O operations and may benefit from more threads than that.
78+
79+
The way to achieve true parallelism with Ruby is to use multiple processes. As long as there is a free CPU core, Ruby
80+
processes don't have to wait on one another before resuming execution after an I/O operation is complete.
81+
However, processes only share a fraction of their memory via [copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write),
82+
so one additional process uses more memory than an additional thread would.
83+
84+
Note that while threads are cheaper than processes, they are not free, and increasing the number of threads per process,
85+
also increases memory usage.
86+
87+
### Practical Implications
88+
89+
Users interested in optimizing for throughput and server utilization will want to run one process per CPU core and
90+
increase the number of threads per process until the impact on latency is dimmed too important.
91+
92+
Users interested in optimizing for latency will want to keep the number of threads per process low.
93+
To optimize for latency even further, users can even set the thread per process count to `1` and run `1.5` or `1.3`
94+
process per CPU core to account for when processes are idle waiting for I/O operations.
95+
96+
It is important to note that some hosting solutions may only offer a relatively small amount of memory (RAM) per CPU
97+
core, preventing you from running as many processes as needed to use all CPU cores.
98+
However, most hosting solutions have different plans with different ratios of memory and CPU.
99+
100+
Another thing to consider is that Ruby memory usage benefits from economies of scale thanks to
101+
[copy-on-write](https://en.wikipedia.org/wiki/Copy-on-write).
102+
So `2` servers with `32` Ruby processes each will use less memory per CPU core than `16` servers with `4` Ruby processes
103+
each.
104+
105+
Configurations
106+
--------------
107+
108+
### Puma
109+
110+
The Puma configuration resides in the `config/puma.rb` file.
111+
The two most important Puma configurations are the number of threads per process, and the number of processes,
112+
which Puma calls `workers`.
113+
114+
The number of threads per process is configured via the `thread` directive.
115+
In the default generated configuration, it is set to `3`.
116+
You can modify it either by setting the `RAILS_MAX_THREADS` environment variable or simply editing the configuration
117+
file.
118+
119+
The number of processes is configured by the `workers` directive.
120+
If you use more than one thread per process, then it should be set to how many CPU cores are available on the server,
121+
or if the server is running multiple applications, to how many cores you want the application to use.
122+
If you only use one thread per worker, then you can increase it to above one per process to account for when workers are
123+
idle waiting for I/O operations.
124+
In the default generated configuration, it is set to use all the available processor cores on the server via the
125+
`Concurrent.available_processor_count` helper. You can also modify it by setting the `WEB_CONCURRENCY` environment variable.
126+
127+
### YJIT
128+
129+
Recent Ruby versions come with a [Just-in-time compiler](https://en.wikipedia.org/wiki/Just-in-time_compilation)
130+
called [`YJIT`](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md).
131+
132+
Without going into too many details, JIT compilers allow to execute code faster, at the expense of using some more
133+
memory.
134+
Unless you really cannot spare this extra memory usage, it is highly recommended to enable YJIT.
135+
136+
As for Rails 7.2, if your application is running on Ruby 3.3 or superior, YJIT will automatically be enabled by Rails
137+
by default.
138+
Older versions of Rails or Ruby have to enable it manually, please refer to the
139+
[`YJIT documentation`](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md) about how to do it.
140+
141+
If the extra memory usage is a problem, before entirely disabling YJIT, you can try tuning it to use less memory via
142+
[the `--yjit-exec-mem-size` configuration](https://github.com/ruby/ruby/blob/master/doc/yjit/yjit.md#decreasing---yjit-exec-mem-size).
143+
144+
### Memory Allocators and Configuration
145+
146+
Because of how the default memory allocator works on most Linux distributions, running Puma with multiple threads can
147+
lead to an unexpected increase in memory usage caused by [memory fragmentation](https://en.wikipedia.org/wiki/Fragmentation_\(computing\)).
148+
In turn, this increased memory usage may prevent your application from fully utilizing the server CPU cores.
149+
150+
To alleviate this problem, it is highly recommended the configure Ruby to use an alternative memory allocator:
151+
[jemalloc](https://github.com/jemalloc/jemalloc).
152+
153+
The default Dockerfile generated by Rails already comes preconfigured to install and use `jemalloc`. But if your hosting
154+
solution isn't Docker based, you should look into how to install and enable jemalloc there.
155+
156+
If for some reason that isn't possible, a less efficient alternative is to configure the default allocator in a way that
157+
reduces memory fragmentation by setting `MALLOC_ARENA_MAX=2` in your environment.
158+
Note however that this might make Ruby slower, so `jemalloc` is the preferred solution.
159+
160+
Performance Testing
161+
--------------------
162+
163+
Because every Rails application is different, and that every Rails user may want to optimize for different properties,
164+
it is impossible to offer a default configuration or guidelines that works best for everyone.
165+
166+
Hence, the best way to choose your application's settings is to measure the performance of your application, and adjust
167+
the configuration until it is satisfactory for your goals.
168+
169+
This can be done with a simulated production workload, or directly in production with live application traffic.
170+
171+
Performance testing is a deep subject. This guide gives only simple guidelines.
172+
173+
### What to Measure
174+
175+
Throughput is the number of requests per second that your application successfully processes.
176+
Any good load testing program will measure it.
177+
A throughput is normally a single number expressed in "requests per second".
178+
179+
Latency is the delay from the time the request is sent until its response is successfully received, generally expressed
180+
in milliseconds.
181+
Each individual request will have its own latency.
182+
183+
[Percentile](https://en.wikipedia.org/wiki/Percentile_rank) latency gives the latency where a certain percentage of
184+
requests have better latency than that.
185+
For instance, `P90` is the 90th-percentile latency.
186+
The `P90` is the latency for a single load test where only 10% of requests took longer than that to process.
187+
The `P50` is the latency such that half your requests were slower, also called the median latency.
188+
189+
"Tail latency" refers to high-percentile latencies.
190+
For instance, the `P99` is the latency such that only 1% of your requests were worse.
191+
`P99` is a tail latency.
192+
`P50` is not a tail latency.
193+
194+
Generally speaking, the average latency isn't a good metric to optimize for.
195+
It is best to focus on median (`P50`) and tail (`P95` or `P99`) latency.
196+
197+
### Production Measurement
198+
199+
If your production environment includes more than one server, it can be a good idea to do
200+
[A/B testing](https://en.wikipedia.org/wiki/A/B_testing) there.
201+
For instance, you could run half of the servers with `3` threads per process, and the other half with `4` threads per
202+
process, and then use an application performance monitoring service to compare the throughput and latency of the two
203+
groups.
204+
205+
Application performance monitoring services are numerous, some are self-hosted, some are cloud solutions, and many offer
206+
a free tier plan.
207+
Recommending a particular one is beyond the scope of this guide.
208+
209+
### Load Testers
210+
211+
You will need a load testing program to make requests of your application.
212+
This can be a dedicated load testing program of some kind, or you can write a small application to make HTTP requests
213+
and track how long they take.
214+
You should not normally check the time in your Rails log file.
215+
That time is only how long Rails took to process the request. It does not include time taken by the application server.
216+
217+
Sending many simultaneous requests and timing them can be difficult. It is easy to introduce subtle measurement errors.
218+
Normally you should use a load testing program, not write your own. Many load testers are simple to use and many
219+
excellent load testers are free.
220+
221+
### What You Can Change
222+
223+
You can change the number of threads in your test to find the best tradeoff between throughput and latency for your
224+
application.
225+
226+
Larger hosts with more memory and CPU cores will need more processes for best usage.
227+
You can vary the size and type of hosts from a hosting provider.
228+
229+
Increasing the number of iterations will usually give a more exact answer, but require longer for testing.
230+
231+
You should test on the same type of host that will run in production.
232+
Testing on your development machine will only tell you what settings are best for that development machine.
233+
234+
### Warmup
235+
236+
Your application should process a number of requests after startup that are not included in your final measurements.
237+
These requests are called "warmup" requests, and are usually much slower than later "steady-state" requests.
238+
239+
Your load testing program will usually support warmup requests. You can also run it more than once and throw away the
240+
first set of times.
241+
242+
You have enough warmup requests when increasing the number does not significantly change your result.
243+
[The theory behind this can be complicated](https://arxiv.org/abs/1602.00602) but most common situations are
244+
straightforward: test several times with different amounts of warmup. See how many warmup iterations are needed before
245+
the results stay roughly the same.
246+
247+
Very long warmup can be useful for testing memory fragmentation and other issues that happen only after many requests.
248+
249+
### Which Requests
250+
251+
Your application probably accepts many different HTTP requests.
252+
You should begin by load testing with just a few of them.
253+
You can add more kinds of requests over time.
254+
If a particular kind of request is too slow in your production application, you can add it to your load testing code.
255+
256+
A synthetic workload cannot perfectly match your application's production traffic.
257+
It is still helpful for testing configurations.
258+
259+
### What to Look For
260+
261+
Your load testing program should allow you to check latencies, including percentile and tail latencies.
262+
263+
For different numbers of processes and threads, or different configurations in general, check the throughput and one or
264+
more latencies such as P50, P90, and P99.
265+
Increasing the threads will improve throughput up to a point, but worsen latency.
266+
267+
Choose a tradeoff between latency and throughput based on your application's needs.

railties/lib/rails/generators/rails/app/templates/config/puma.rb.tt

Lines changed: 3 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -32,10 +32,9 @@ when "production"
3232
# If you are running more than 1 thread per process, the workers count
3333
# should be equal to the number of processors (CPU cores) in production.
3434
#
35-
# It defaults to 1 because it's impossible to reliably detect how many
36-
# CPU cores are available. Make sure to set the `WEB_CONCURRENCY` environment
37-
# variable to match the number of processors.
38-
workers_count = Integer(ENV.fetch("WEB_CONCURRENCY", 1))
35+
# Automatically detect the number of available processors in production.
36+
require "concurrent-ruby"
37+
workers_count = Integer(ENV.fetch("WEB_CONCURRENCY") { Concurrent.available_processor_count })
3938
workers workers_count if workers_count > 1
4039

4140
preload_app!

0 commit comments

Comments
 (0)