|
| 1 | +This document summarizes the common approaches for performance fine tuning with |
| 2 | +jemalloc (as of 5.1.0). The default configuration of jemalloc tends to work |
| 3 | +reasonably well in practice, and most applications should not have to tune any |
| 4 | +options. However, in order to cover a wide range of applications and avoid |
| 5 | +pathological cases, the default setting is sometimes kept conservative and |
| 6 | +suboptimal, even for many common workloads. When jemalloc is properly tuned for |
| 7 | +a specific application / workload, it is common to improve system level metrics |
| 8 | +by a few percent, or make favorable trade-offs. |
| 9 | + |
| 10 | + |
| 11 | +## Notable runtime options for performance tuning |
| 12 | + |
| 13 | +Runtime options can be set via |
| 14 | +[malloc_conf](http://jemalloc.net/jemalloc.3.html#tuning). |
| 15 | + |
| 16 | +* [background_thread](http://jemalloc.net/jemalloc.3.html#background_thread) |
| 17 | + |
| 18 | + Enabling jemalloc background threads generally improves the tail latency for |
| 19 | + application threads, since unused memory purging is shifted to the dedicated |
| 20 | + background threads. In addition, unintended purging delay caused by |
| 21 | + application inactivity is avoided with background threads. |
| 22 | + |
| 23 | + Suggested: `background_thread:true` when jemalloc managed threads can be |
| 24 | + allowed. |
| 25 | + |
| 26 | +* [metadata_thp](http://jemalloc.net/jemalloc.3.html#opt.metadata_thp) |
| 27 | + |
| 28 | + Allowing jemalloc to utilize transparent huge pages for its internal |
| 29 | + metadata usually reduces TLB misses significantly, especially for programs |
| 30 | + with large memory footprint and frequent allocation / deallocation |
| 31 | + activities. Metadata memory usage may increase due to the use of huge |
| 32 | + pages. |
| 33 | + |
| 34 | + Suggested for allocation intensive programs: `metadata_thp:auto` or |
| 35 | + `metadata_thp:always`, which is expected to improve CPU utilization at a |
| 36 | + small memory cost. |
| 37 | + |
| 38 | +* [dirty_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.dirty_decay_ms) and |
| 39 | + [muzzy_decay_ms](http://jemalloc.net/jemalloc.3.html#opt.muzzy_decay_ms) |
| 40 | + |
| 41 | + Decay time determines how fast jemalloc returns unused pages back to the |
| 42 | + operating system, and therefore provides a fairly straightforward trade-off |
| 43 | + between CPU and memory usage. Shorter decay time purges unused pages faster |
| 44 | + to reduces memory usage (usually at the cost of more CPU cycles spent on |
| 45 | + purging), and vice versa. |
| 46 | + |
| 47 | + Suggested: tune the values based on the desired trade-offs. |
| 48 | + |
| 49 | +* [narenas](http://jemalloc.net/jemalloc.3.html#opt.narenas) |
| 50 | + |
| 51 | + By default jemalloc uses multiple arenas to reduce internal lock contention. |
| 52 | + However high arena count may also increase overall memory fragmentation, |
| 53 | + since arenas manage memory independently. When high degree of parallelism |
| 54 | + is not expected at the allocator level, lower number of arenas often |
| 55 | + improves memory usage. |
| 56 | + |
| 57 | + Suggested: if low parallelism is expected, try lower arena count while |
| 58 | + monitoring CPU and memory usage. |
| 59 | + |
| 60 | +* [percpu_arena](http://jemalloc.net/jemalloc.3.html#opt.percpu_arena) |
| 61 | + |
| 62 | + Enable dynamic thread to arena association based on running CPU. This has |
| 63 | + the potential to improve locality, e.g. when thread to CPU affinity is |
| 64 | + present. |
| 65 | + |
| 66 | + Suggested: try `percpu_arena:percpu` or `percpu_arena:phycpu` if |
| 67 | + thread migration between processors is expected to be infrequent. |
| 68 | + |
| 69 | +Examples: |
| 70 | + |
| 71 | +* High resource consumption application, prioritizing CPU utilization: |
| 72 | + |
| 73 | + `background_thread:true,metadata_thp:auto` combined with relaxed decay time |
| 74 | + (increased `dirty_decay_ms` and / or `muzzy_decay_ms`, |
| 75 | + e.g. `dirty_decay_ms:30000,muzzy_decay_ms:30000`). |
| 76 | + |
| 77 | +* High resource consumption application, prioritizing memory usage: |
| 78 | + |
| 79 | + `background_thread:true` combined with shorter decay time (decreased |
| 80 | + `dirty_decay_ms` and / or `muzzy_decay_ms`, |
| 81 | + e.g. `dirty_decay_ms:5000,muzzy_decay_ms:5000`), and lower arena count |
| 82 | + (e.g. number of CPUs). |
| 83 | + |
| 84 | +* Low resource consumption application: |
| 85 | + |
| 86 | + `narenas:1,lg_tcache_max:13` combined with shorter decay time (decreased |
| 87 | + `dirty_decay_ms` and / or `muzzy_decay_ms`,e.g. |
| 88 | + `dirty_decay_ms:1000,muzzy_decay_ms:0`). |
| 89 | + |
| 90 | +* Extremely conservative -- minimize memory usage at all costs, only suitable when |
| 91 | +allocation activity is very rare: |
| 92 | + |
| 93 | + `narenas:1,tcache:false,dirty_decay_ms:0,muzzy_decay_ms:0` |
| 94 | + |
| 95 | +Note that it is recommended to combine the options with `abort_conf:true` which |
| 96 | +aborts immediately on illegal options. |
| 97 | + |
| 98 | +## Beyond runtime options |
| 99 | + |
| 100 | +In addition to the runtime options, there are a number of programmatic ways to |
| 101 | +improve application performance with jemalloc. |
| 102 | + |
| 103 | +* [Explicit arenas](http://jemalloc.net/jemalloc.3.html#arenas.create) |
| 104 | + |
| 105 | + Manually created arenas can help performance in various ways, e.g. by |
| 106 | + managing locality and contention for specific usages. For example, |
| 107 | + applications can explicitly allocate frequently accessed objects from a |
| 108 | + dedicated arena with |
| 109 | + [mallocx()](http://jemalloc.net/jemalloc.3.html#MALLOCX_ARENA) to improve |
| 110 | + locality. In addition, explicit arenas often benefit from individually |
| 111 | + tuned options, e.g. relaxed [decay |
| 112 | + time](http://jemalloc.net/jemalloc.3.html#arena.i.dirty_decay_ms) if |
| 113 | + frequent reuse is expected. |
| 114 | + |
| 115 | +* [Extent hooks](http://jemalloc.net/jemalloc.3.html#arena.i.extent_hooks) |
| 116 | + |
| 117 | + Extent hooks allow customization for managing underlying memory. One use |
| 118 | + case for performance purpose is to utilize huge pages -- for example, |
| 119 | + [HHVM](https://github.com/facebook/hhvm/blob/master/hphp/util/alloc.cpp) |
| 120 | + uses explicit arenas with customized extent hooks to manage 1GB huge pages |
| 121 | + for frequently accessed data, which reduces TLB misses significantly. |
| 122 | + |
| 123 | +* [Explicit thread-to-arena |
| 124 | + binding](http://jemalloc.net/jemalloc.3.html#thread.arena) |
| 125 | + |
| 126 | + It is common for some threads in an application to have different memory |
| 127 | + access / allocation patterns. Threads with heavy workloads often benefit |
| 128 | + from explicit binding, e.g. binding very active threads to dedicated arenas |
| 129 | + may reduce contention at the allocator level. |
0 commit comments