Skip to content

Conversation

@the-mikedavis
Copy link

This is a somewhat small set of patches to prometheus_text_format which aim to reduce garbage creation during registry formatting. Reducing garbage creation drives down the cost to the VM of scraping large registries - both in terms of peak memory allocation and also the work that the garbage collector must do.

With these changes I see reduction in allocation reported by tprof in a stress test of one of RabbitMQ's most expensive registries. In a test against single-instance RabbitMQ brokers on EC2 instances this saves a noticeable amount of peak memory and reduces CPU utilization significantly.

tprof testing instructions
  1. Clone https://github.com/rabbitmq/rabbitmq-server
  2. cd rabbitmq-server
  3. make deps
  4. make run-broker
  5. In another terminal in the rabbitmq-server repo, sbin/rabbitmqctl import_definitions path/to/100k-classic-queues.json pointing to this definitions file.
  6. In the shell from the make run-broker terminal, start tprof tracing for new processes: tprof:start(#{type => call_memory}), tprof:enable_trace(new), tprof:set_pattern('_', '_', '_').
  7. In another terminal scrape the expensive endpoint: curl -v localhost:15692/metrics/per-object --output /dev/null
  8. When that's done, collect and format the sample: tprof:format(tprof:inspect(tprof:collect())).

To test this change, Ctrlc twice out of make broker, cd deps/prometheus and check out this branch. Then rm -rf ebin in that directory, cd ../../ and repeat steps 4, 6, 7 and 8 again (skipping definitions import).


Registry collection tprof measurement before this change...
****** Process <0.301089.0>  --  100.00% of total *** 
FUNCTION                                                                                   CALLS      WORDS    PER CALL  [    %]
... removed everything less than 1% ...
prometheus_text_format:render_labels/1                                                   2308195    1944642        0.84  [ 1.01]
erlang:atom_to_binary/2                                                                   651584    2375647        3.65  [ 1.23]
prometheus_rabbitmq_core_metrics_collector:'-emit_queue_info/3-fun-0-'/3                  100000    2500000       25.00  [ 1.29]
prometheus_model_helpers:counter_metric/2                                                 301325    3615900       12.00  [ 1.87]
prometheus_text_format:'-render_labels/1-fun-0-'/2                                        321434    4178642       13.00  [ 2.16]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^1/1-0-'/2             2300145    4400076        1.91  [ 2.28]
prometheus_model_helpers:'-metrics_from_tuples/2-lc$^0/1-0-'/2                           2308456    4616300        2.00  [ 2.39]
lists:'-filter/2-lc$^0/1-0-'/2                                                           2408461    4816304        2.00  [ 2.49]
erlang:integer_to_binary/1                                                               2206892    6620701        3.00  [ 3.43]
prometheus_rabbitmq_core_metrics_collector:label/1                                       2200038   11000022        5.00  [ 5.69]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^0/1-1-'/2             2300145   11500190        5.00  [ 5.95]
prometheus_text_format:'-emit_mf_metrics/2-fun-0-'/3                                     2308150   11541419        5.00  [ 5.97]
prometheus_model_helpers:gauge_metric/2                                                  2006812   24081744       12.00  [12.47]
prometheus_text_format:has_special_char/1                                               23475329   24147190        1.03  [12.50]
prometheus_text_format:render_series/3                                                   2308200   32511401       14.09  [16.83]
ets:match_object/2                                                                            19   38406095  2021373.42  [19.88]
                                                                                                  193184463              [100.0]

Registry collection tprof measurement after this change...
****** Process <0.401000.0>  --  99.99% of total *** 
FUNCTION                                                                                  CALLS      WORDS    PER CALL  [    %]
... removed everything less than 1% ...
prometheus_model_helpers:label_pair/1                                                    429393    1717572        4.00  [ 1.16]
prometheus_text_format:render_labels/1                                                  2308195    1944642        0.84  [ 1.32]
erlang:atom_to_binary/2                                                                  651584    2375647        3.65  [ 1.61]
prometheus_rabbitmq_core_metrics_collector:'-emit_queue_info/3-fun-0-'/3                 100000    2500000       25.00  [ 1.69]
prometheus_model_helpers:counter_metric/2                                                301325    3615900       12.00  [ 2.45]
prometheus_text_format:'-render_labels/1-fun-0-'/2                                       321434    4178642       13.00  [ 2.83]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^1/1-0-'/2            2300145    4400076        1.91  [ 2.98]
prometheus_model_helpers:'-metrics_from_tuples/2-lc$^0/1-0-'/2                          2308456    4616300        2.00  [ 3.13]
lists:'-filter/2-lc$^0/1-0-'/2                                                          2408461    4816304        2.00  [ 3.26]
erlang:integer_to_binary/1                                                              2206892    6620705        3.00  [ 4.49]
prometheus_rabbitmq_core_metrics_collector:label/1                                      2200038   11000022        5.00  [ 7.45]
prometheus_rabbitmq_core_metrics_collector:'-collect_metrics/2-lc$^0/1-1-'/2            2300145   11500190        5.00  [ 7.79]
prometheus_text_format:render_series/4                                                  2308200   11541000        5.00  [ 7.82]
prometheus_text_format:render_value/2                                                   2308200   11543618        5.00  [ 7.82]
prometheus_model_helpers:gauge_metric/2                                                 2006812   24081744       12.00  [16.32]
ets:match_object/2                                                                           19   38406095  2021373.42  [26.02]
                                                                                                 147597866              [100.0]

So with this change, the Cowboy request process in charge of this endpoint allocates 147_597_866 words instead of 193_184_463, a reduction of 45_586_597 words or 23.6%.

Stress-testing on EC2...

On EC2 I have two m7g.xlarge instances running RabbitMQ: galactica which carries this change and kestrel which uses prometheus at v5.1.1 (latest version RabbitMQ has adopted). A third instance curls these instances at an interval of two seconds with this script:

#! /usr/bin/env bash

N=600
SLEEP=2
for i in $(seq 1 $N)
do
  echo "Sleeping ${SLEEP}s... ($i / $N)"
  sleep $SLEEP
  echo "Ask for metrics from $1... ($i / $N)"
  curl -s "http://$1:15692/metrics/per-object" --output /dev/null &
done

wait

This asynchronously fires off a scrape request every two seconds for twenty minutes. The third node runs this script against both galactica and kestrel at the same time. The third node also scrapes these nodes' node_exporter metrics and RabbitMQ prometheus endpoint for Erlang allocator metrics.

kestrel (baseline)

Instance-wide memory usage
grafana-kestrel-mem
Instance-wide CPU usage
grafana-kestrel-cpu
Erlang allocators
grafana-kestrel-erlang-alloc

galactica (this branch)

Instance-wide memory usage
grafana-galactica-mem
Instance-wide CPU usage
grafana-galactica-cpu
Erlang allocators
grafana-galactica-erlang-alloc

We can see kestrel (baseline) pinned at around 95% CPU usage consistently, hovering at around 9-10 GB instance-wide memory usage and the VM aware of 3.5-4.5 GB of usage. And galactica (this branch) sitting at 50% CPU usage, around 7.5-8.5 GB instance-wide memory and the VM tracking around 2-3 GB of memory.

While the peak memory usage is reduced nicely, the main benefit is the CPU is loaded much less than before - I assume from performing less garbage collection.

`prometheus_text_format:has_special_char/1` is called very often when
a registry contains many metrics with label pairs. We can use
`binary:match/2` to search within a label binary for the special
characters (newline, backslash and double-quote) without allocation.

The old code using binary match syntax creates a match context every
time the function is called (except, not recursion - then the match
context is reused). A match context allocates 5 words to the process
heap when it is created. When matching many many binaries this scales to
create a noticeable amount of short-lived garbage.

In comparison `binary:match/2` with a precompiled match pattern does not
allocate. The BIF for it is also very well optimized, using `memchr`
since OTP 22.
The formatting callback for a registry can build each metrics family as
a single binary in order to reduce garbage. This mainly involves passing
the accumulator binary through all functions that append to it.

It's more efficient to append to the resulting binary than to allocate
smaller binaries and then append them. For example:

    <<Blob/binary, Name/binary, "_", Suffix/binary>>.
    %% versus
    Combined = <<Name/binary, "_", Suffix/binary>>,
    <<Blob/binary, Combined/binary>>.

The first expression generates less garbage than the second. A good
example of this was the `add_brackets/1` function which was inlined.
Inlining does not turn the first expression (above) into the second
according to the compiler unfortunately, so we pay the cost of creating
a binary with brackets and then formatting that into the larger blob,
rather than formatting in just by copying. This change manually inlines
`add_brackets/1` into its caller `render_series/4`.

This change also changes some list strings into binaries. Especially for
ASCII, strings binaries are _far_ more compact than lists. Lists need
two words per ASCII character - one for the character and one for the
tail pointer. So it's like UTF-32 but worse, basically UTF-128 on a 64
bit machine. ASCII or UTF-8 text in binaries takes one byte per
character in the binary's array, plus a word or two of metadata. E.g.
`<<"hello">>` allocates three words while `"hello"` allocates ten.
@the-mikedavis the-mikedavis marked this pull request as draft November 6, 2025 18:17
Building on the work in the parent commit, now that the data being
passed to the `ram_file` is a binary, we can instead build the entire
output gradually within the process. We pay in terms of I/O overhead
from writing and then reading from the `ram_file` since `ram_file` is a
port - all data is passed between the VM and the port driver. The memory
consumed by a port driver is also invisible to the VM's allocator, so
large port driver resource usage should be avoided where possible.

Instead this change refactors the `registry_collect_callback` to fold
over collectors and build an accumulator. The `create_mf` callback's
return of `ok` forces us to store this rather than pass and return it.
So it's a little less hygienic but is more efficient than passing data
in/out of a port.

This also introduces a function `format_into/3` which can use this
folding function directly. This can be used to avoid collecting the
entire response in one binary. Instead the response can be streamed
with `cowboy_req:stream_body/3` for example.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant