Skip to content

Conversation

@the-mikedavis
Copy link
Collaborator

prometheus_text_format:format/1 produces a binary of the format for the entire registry. For clusters with many resources, this can lead to large replies from /metrics/[:registry] especially for large registries like per-object. Instead of formatting the response and then sending it, we can stream the response by taking advantage of the new format_into/3 callback (which needs to be added upstream to the prometheus dep). This uses cowboy_req:stream_body/3 to stream the iodata as prometheus works through the registry.

This should hopefully be a nice memory improvement. The other benefit is that results are sent eagerly. For a stress-testing example,

  1. make run-broker
  2. rabbitmqctl import_definitions path/to/100k-classic-queues.json
  3. curl -s localhost:15692/metrics/per-object

Before this change curl would wait for around 8 seconds and then the entire response would arrive. With this change the results start streaming in immediately.

Discussed in #14865


Draft as I would like to collect some memory-usage metrics before and after the change...

@the-mikedavis the-mikedavis self-assigned this Nov 3, 2025
@mergify mergify bot added the make label Nov 3, 2025
@michaelklishin michaelklishin changed the title Stream HTTP responses from rabbit_prometheus_handler Optimization: stream HTTP responses from rabbit_prometheus_handler Nov 3, 2025
@the-mikedavis
Copy link
Collaborator Author

Ah welp, when measured this doesn't look as promising as I thought. I have three EC2 instances, one acting as the scraper and the other two "galactica" and "kestrel" running single-instance brokers with the 100k-classic-queues.json definition import. The scraping node runs this script to scrape from each node every 2 seconds:

#! /usr/bin/env bash

N=600
SLEEP=2
for i in $(seq 1 $N)
do
  echo "Sleeping ${SLEEP}s... ($i / $N)"
  sleep $SLEEP
  echo "Ask for metrics from $1... ($i / $N)"
  curl -s "http://$1:15692/metrics/per-object" --output /dev/null &
done

wait

I swapped which node was running which branch, but we can see that this branch consistently has more EC2 instance-wide memory usage rather than less! Galactica:

galactica

Kestrel:

kestrel

In the first test (01:03 - 01:23) Galactica runs main and in the second (02:00 - 02:20) Kestrel runs main.

So it looks like this branch is worse for memory usage as-is. I will have to do a bit more digging. Seems like passing the iodata to the Cowboy process might be creating more garbage than writing the data to the ram_file port. We might be able to buffer some of the iodata in the callback or restructure things in the prometheus dep to improve memory usage.

@lhoguin
Copy link
Contributor

lhoguin commented Nov 4, 2025

How big does the counter get? You may have increased the number of messages to Cowboy drastically and each message has a cost (data gets processed and buffered at various steps of the sending process).

@the-mikedavis
Copy link
Collaborator Author

It's actually not quite as big as I thought, I see 147_161_056 bytes sent in 557 calls to cowboy_req:stream_body/3. Looking at the formatting code, per metric-family the prelude (# TYPE and # HELP) is formatted with an iolist and then the metrics themselves are all built with one binary taking advantage of the VM's binary append optimization. Let me try making an iolist out of the prelude and metrics binary so we cut the number of cowboy_req:stream_body/3s in half. Also the binaries must be pretty large, I wonder if process_flag(fullsweep_after, 0) could help with the garbage collection here.

@the-mikedavis
Copy link
Collaborator Author

Setting fullsweep_after does seem to help. Reducing the number of calls to cowboy_req:stream_body/3 doesn't seem to make a difference. Looking an observer_cli I think what's happening here is that the streaming version just has more outstanding requests at once (cowboy_stream_h:request_process/3) and each has a fairly large size in its process heap (~200-350 MB). And it has more outstanding requests because it's slower, and the script scrapes on an interval rather than waiting for the full response.

$ time curl -v http://<streaming>:15692/metrics/per-object --output /dev/null
real	0m7.490s
user	0m0.012s
sys	0m0.056s
$ time curl -v http://<ram_file>:15692/metrics/per-object --output /dev/null
real	0m5.790s
user	0m0.009s
sys	0m0.046s

@the-mikedavis the-mikedavis force-pushed the md/prometheus-streaming branch 2 times, most recently from bafd4fc to 988acf3 Compare November 6, 2025 19:29
`prometheus_text_format:format/1` produces a binary of the format for
the entire registry. For clusters with many resources, this can lead to
large replies from `/metrics/[:registry]` especially for large
registries like `per-object`. Instead of formatting the response and
then sending it, we can stream the response by taking advantage of the
new `format_into/3` callback (which needs to be added upstream to the
`prometheus` dep). This uses `cowboy_req:stream_body/3` to stream the
iodata as `prometheus` works through the registry.

This should hopefully be a nice memory improvement. The other benefit
is that results are sent eagerly. For a stress-testing example,

1. `make run-broker`
2. `rabbitmqctl import_definitions path/to/100k-classic-queues.json`
3. `curl -s localhost:15692/metrics/per-object`

Before this change `curl` would wait for around 8 seconds and then the
entire response would arrive. With this change the results start
streaming in immediately.
@the-mikedavis the-mikedavis force-pushed the md/prometheus-streaming branch from 988acf3 to 7de28f3 Compare November 6, 2025 23:15
@the-mikedavis
Copy link
Collaborator Author

Ok! I spent some time looking at allocation in prometheus_text_format:format/1 and with optimizations there, this change to stream results is looking like a clear improvement now.

Tracking host usage with Prometheus' node_exporter this time, galactica runs this branch and kestrel runs main:

kestrel / baseline (CPU)

grafana-kestrel-cpu

kestrel / baseline (RAM)

grafana-kestrel-mem

galactica / this change (CPU)

grafana-galactica-cpu

galactica / this branch (MEM)

grafana-galactica-mem

This uses the same setup as above to scrape per-object metrics for single-instance brokers with 100k-classic-queues.json imported every two seconds. This is using two m7g.xlarge EC2 instances (4vCPU ARM, 16 GB RAM) running RabbitMQ via make on Erlang/OTP 27. We see main pinned at around 95% CPU usage and hovering around 9 GB of peak memory usage. With this change the CPU usage hovers around 60-65% instead and around 6.5-7.5 GB peak memory usage.

NOTE! The garbage improvements to the prometheus dep actually make it slightly more desirable to not stream the response. CPU instead hovers around 57-61% CPU with similar memory usage. My recommendation is to stream the response anyways since applications which scrape RabbitMQ will be able to work gradually rather than handle the entire response at once.

@michaelklishin
Copy link
Collaborator

So now we have sound double digit % improvements for both CPU and memory footprint. Awesome!

@the-mikedavis
Copy link
Collaborator Author

The peak memory footprint is really at the edge of "double digits" if I'm being honest, it's right around 10% - these instances really have more like 15 GB memory 😅. I thought we would see great peak memory usage improvements here but it's CPU savings instead actually. Reducing the work the GC needs to do seems to pay off.

Looking at the tprof output (see prometheus-erl/prometheus.erl#196) there is more we could do if we were really motivated to optimize these endpoints. We pay a surprisingly high price for the conversions we do with prometheus_model_helpers - every record/tuple we allocate with those helpers adds up, up to 12%, the second highest factor in this tprof output. At the cost of replacing the prometheus dep entirely we could format directly into binaries or iodata and avoid the intermediate allocations. This would be a fairly large change though, and wouldn't benefit the other downstream prometheus dependencies like this does.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants