Optimization: stream HTTP responses from `rabbit_prometheus_handler` #14885

the-mikedavis · 2025-11-03T18:18:21Z

prometheus_text_format:format/1 produces a binary of the format for the entire registry. For clusters with many resources, this can lead to large replies from /metrics/[:registry] especially for large registries like per-object. Instead of formatting the response and then sending it, we can stream the response by taking advantage of the new format_into/3 callback (which needs to be added upstream to the prometheus dep). This uses cowboy_req:stream_body/3 to stream the iodata as prometheus works through the registry.

This should hopefully be a nice memory improvement. The other benefit is that results are sent eagerly. For a stress-testing example,

make run-broker
rabbitmqctl import_definitions path/to/100k-classic-queues.json
curl -s localhost:15692/metrics/per-object

Before this change curl would wait for around 8 seconds and then the entire response would arrive. With this change the results start streaming in immediately.

Discussed in #14865

Draft as I would like to collect some memory-usage metrics before and after the change...

the-mikedavis · 2025-11-04T02:41:43Z

Ah welp, when measured this doesn't look as promising as I thought. I have three EC2 instances, one acting as the scraper and the other two "galactica" and "kestrel" running single-instance brokers with the 100k-classic-queues.json definition import. The scraping node runs this script to scrape from each node every 2 seconds:

#! /usr/bin/env bash

N=600
SLEEP=2
for i in $(seq 1 $N)
do
  echo "Sleeping ${SLEEP}s... ($i / $N)"
  sleep $SLEEP
  echo "Ask for metrics from $1... ($i / $N)"
  curl -s "http://$1:15692/metrics/per-object" --output /dev/null &
done

wait

I swapped which node was running which branch, but we can see that this branch consistently has more EC2 instance-wide memory usage rather than less! Galactica:

Kestrel:

In the first test (01:03 - 01:23) Galactica runs main and in the second (02:00 - 02:20) Kestrel runs main.

So it looks like this branch is worse for memory usage as-is. I will have to do a bit more digging. Seems like passing the iodata to the Cowboy process might be creating more garbage than writing the data to the ram_file port. We might be able to buffer some of the iodata in the callback or restructure things in the prometheus dep to improve memory usage.

lhoguin · 2025-11-04T08:43:51Z

How big does the counter get? You may have increased the number of messages to Cowboy drastically and each message has a cost (data gets processed and buffered at various steps of the sending process).

the-mikedavis · 2025-11-04T14:40:20Z

It's actually not quite as big as I thought, I see 147_161_056 bytes sent in 557 calls to cowboy_req:stream_body/3. Looking at the formatting code, per metric-family the prelude (# TYPE and # HELP) is formatted with an iolist and then the metrics themselves are all built with one binary taking advantage of the VM's binary append optimization. Let me try making an iolist out of the prelude and metrics binary so we cut the number of cowboy_req:stream_body/3s in half. Also the binaries must be pretty large, I wonder if process_flag(fullsweep_after, 0) could help with the garbage collection here.

the-mikedavis · 2025-11-04T15:14:21Z

Setting fullsweep_after does seem to help. Reducing the number of calls to cowboy_req:stream_body/3 doesn't seem to make a difference. Looking an observer_cli I think what's happening here is that the streaming version just has more outstanding requests at once (cowboy_stream_h:request_process/3) and each has a fairly large size in its process heap (~200-350 MB). And it has more outstanding requests because it's slower, and the script scrapes on an interval rather than waiting for the full response.

$ time curl -v http://<streaming>:15692/metrics/per-object --output /dev/null
real	0m7.490s
user	0m0.012s
sys	0m0.056s
$ time curl -v http://<ram_file>:15692/metrics/per-object --output /dev/null
real	0m5.790s
user	0m0.009s
sys	0m0.046s

`prometheus_text_format:format/1` produces a binary of the format for the entire registry. For clusters with many resources, this can lead to large replies from `/metrics/[:registry]` especially for large registries like `per-object`. Instead of formatting the response and then sending it, we can stream the response by taking advantage of the new `format_into/3` callback (which needs to be added upstream to the `prometheus` dep). This uses `cowboy_req:stream_body/3` to stream the iodata as `prometheus` works through the registry. This should hopefully be a nice memory improvement. The other benefit is that results are sent eagerly. For a stress-testing example, 1. `make run-broker` 2. `rabbitmqctl import_definitions path/to/100k-classic-queues.json` 3. `curl -s localhost:15692/metrics/per-object` Before this change `curl` would wait for around 8 seconds and then the entire response would arrive. With this change the results start streaming in immediately.

the-mikedavis · 2025-11-06T23:16:55Z

Ok! I spent some time looking at allocation in prometheus_text_format:format/1 and with optimizations there, this change to stream results is looking like a clear improvement now.

Tracking host usage with Prometheus' node_exporter this time, galactica runs this branch and kestrel runs main:

`kestrel` / baseline (CPU)

`kestrel` / baseline (RAM)

`galactica` / this change (CPU)

`galactica` / this branch (MEM)

This uses the same setup as above to scrape per-object metrics for single-instance brokers with 100k-classic-queues.json imported every two seconds. This is using two m7g.xlarge EC2 instances (4vCPU ARM, 16 GB RAM) running RabbitMQ via make on Erlang/OTP 27. We see main pinned at around 95% CPU usage and hovering around 9 GB of peak memory usage. With this change the CPU usage hovers around 60-65% instead and around 6.5-7.5 GB peak memory usage.

NOTE! The garbage improvements to the prometheus dep actually make it slightly more desirable to not stream the response. CPU instead hovers around 57-61% CPU with similar memory usage. My recommendation is to stream the response anyways since applications which scrape RabbitMQ will be able to work gradually rather than handle the entire response at once.

michaelklishin · 2025-11-06T23:25:58Z

So now we have sound double digit % improvements for both CPU and memory footprint. Awesome!

the-mikedavis · 2025-11-07T00:27:30Z

The peak memory footprint is really at the edge of "double digits" if I'm being honest, it's right around 10% - these instances really have more like 15 GB memory 😅. I thought we would see great peak memory usage improvements here but it's CPU savings instead actually. Reducing the work the GC needs to do seems to pay off.

Looking at the tprof output (see prometheus-erl/prometheus.erl#196) there is more we could do if we were really motivated to optimize these endpoints. We pay a surprisingly high price for the conversions we do with prometheus_model_helpers - every record/tuple we allocate with those helpers adds up, up to 12%, the second highest factor in this tprof output. At the cost of replacing the prometheus dep entirely we could format directly into binaries or iodata and avoid the intermediate allocations. This would be a fairly large change though, and wouldn't benefit the other downstream prometheus dependencies like this does.

the-mikedavis requested a review from mkuratczyk November 3, 2025 18:18

the-mikedavis self-assigned this Nov 3, 2025

mergify bot added the make label Nov 3, 2025

the-mikedavis mentioned this pull request Nov 3, 2025

Add prometheus_text_format:format_into/3 prometheus-erl/prometheus.erl#194

Closed

michaelklishin changed the title ~~Stream HTTP responses from rabbit_prometheus_handler~~ Optimization: stream HTTP responses from rabbit_prometheus_handler Nov 3, 2025

michaelklishin added the backport-v4.2.x label Nov 3, 2025

michaelklishin removed the backport-v4.2.x label Nov 4, 2025

the-mikedavis force-pushed the md/prometheus-streaming branch 2 times, most recently from bafd4fc to 988acf3 Compare November 6, 2025 19:29

the-mikedavis force-pushed the md/prometheus-streaming branch from 988acf3 to 7de28f3 Compare November 6, 2025 23:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Optimization: stream HTTP responses from `rabbit_prometheus_handler` #14885

Optimization: stream HTTP responses from `rabbit_prometheus_handler` #14885

Uh oh!

the-mikedavis commented Nov 3, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

lhoguin commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 6, 2025

Uh oh!

michaelklishin commented Nov 6, 2025

Uh oh!

the-mikedavis commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimization: stream HTTP responses from rabbit_prometheus_handler #14885

Are you sure you want to change the base?

Optimization: stream HTTP responses from rabbit_prometheus_handler #14885

Uh oh!

Conversation

the-mikedavis commented Nov 3, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

lhoguin commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 4, 2025

Uh oh!

the-mikedavis commented Nov 6, 2025

kestrel / baseline (CPU)

kestrel / baseline (RAM)

galactica / this change (CPU)

galactica / this branch (MEM)

Uh oh!

michaelklishin commented Nov 6, 2025

Uh oh!

the-mikedavis commented Nov 7, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Optimization: stream HTTP responses from `rabbit_prometheus_handler` #14885

Optimization: stream HTTP responses from `rabbit_prometheus_handler` #14885

`kestrel` / baseline (CPU)

`kestrel` / baseline (RAM)

`galactica` / this change (CPU)

`galactica` / this branch (MEM)