Skip to content

Conversation

@ansd
Copy link
Member

@ansd ansd commented Sep 12, 2025

This commit makes read operations for the following Khepri projections much cheaper:

  • rabbit_khepri_queue
  • rabbit_khepri_exchange
  • rabbit_khepri_index_route
  • rabbit_khepri_topic_trie

Entries in these ETS tables are read for every message entering RabbitMQ. Some messages entering RabbitMQ will cause even multiple reads from these ETS tables, e.g. multiple reads from rabbit_khepri_queue if a message is routed to more than one queue or multiple reads from rabbit_khepri_index_route if a message has multiple routing keys.

On a busy RabbitMQ node, these tables are read concurrently (on multiple physical processors) hundreds of thousands of times per second.

This commit makes read operations for the following Khepri projections much cheaper:
* rabbit_khepri_queue
* rabbit_khepri_exchange
* rabbit_khepri_index_route
* rabbit_khepri_topic_trie

Entries in these ETS tables are read for every message entering RabbitMQ.
Some messages entering RabbitMQ will cause even multiple reads from
these ETS tables, e.g. multiple reads from `rabbit_khepri_queue` if a message
is routed to more than one queue or multiple reads from `rabbit_khepri_index_route`
if a message has multiple routing keys.

On a busy RabbitMQ node, these tables are read concurrently (on multiple physical processors)
hundreds of thousands of times per second.
@ansd ansd added this to the 4.2.0 milestone Sep 12, 2025
@ansd ansd self-assigned this Sep 12, 2025
@ansd ansd added khepri Khepri as the metadata store performance backport-v4.2.x labels Sep 12, 2025
@dumbbell
Copy link
Collaborator

Did you measure the impact?

Other than that, the patch looks good to me.

@the-mikedavis
Copy link
Collaborator

I agree, setting read_concurrency sounds good to me for these tables. I wonder if it would worsen performance when there is queue or binding churn? And if it does would it matter anyways (since churn is already bad)?

@ansd
Copy link
Member Author

ansd commented Sep 12, 2025

Did you measure the impact?

When I start the broker as follows:

make run-broker LEAVE_PLUGINS_DISABLED=1

and generate load as follows:

java -jar target/perf-test.jar -x 8 -y 1 -s 12 -z 60 --autoack

I get around

id: test-144220-533, sending rate avg: 166958 msg/s
id: test-144220-533, receiving rate avg: 166958 msg/s

prior to this PR and around

id: test-144028-283, sending rate avg: 170046 msg/s
id: test-144028-283, receiving rate avg: 170042 msg/s

after this PR.

But there is quiet a high variability in the different test runs.

That read concurrency option is about contention. There are better workloads to trigger this contention, for example by using a machine with many more CPU cores (e.g. 16 or more) and many more channels. But I haven't taken the time to investigate that deeply.

From the few tests I did, performance seems to get slightly better.

I wonder if it would worsen performance when there is queue or binding churn? And if it does would it matter anyways (since churn is already bad)?

It might likely worsen performance for queue or binding churn. I haven't taken the time to test it. I would argue it's more important to optimise RabbitMQ for high message throughput than for creation and deletion of queues :)

I leave it up to both of you whether to merge or not :)

@ansd
Copy link
Member Author

ansd commented Sep 12, 2025

FWIW we set this option for the routing table in Mnesia:

{storage_properties, [{ets, [{read_concurrency, true}]}]},

Copy link
Collaborator

@the-mikedavis the-mikedavis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried a bunch of perf-test scenarios and I see this branch coming out ahead by just a little bit in any scenario. I don't think there are enough ETS requests in basic publishing scenarios to be perceptible. But I saw a big boost with a perf-test command I had in my notes from working on the topic routing trie in Khepri from way back when which specifically stresses ETS. The more components of the routing key the more pronounced the effect because each component means another call to ETS to traverse the topic trie graph. Also my desktop has a grotesque number of cores (64) so I was able to throw some high concurrency at it.

perf-test -p -e amq.topic -t topic -qp q.%d.a.b.c.d.e.f.g -qpf 1 -qpt 64 -x 64 -y 64 -c 3000 -z 60 -s 12

main:

id: test-135258-593, sending rate avg: 366207 msg/s
id: test-135258-593, receiving rate avg: 366319 msg/s

this PR:

id: test-135750-042, sending rate avg: 426767 msg/s
id: test-135750-042, receiving rate avg: 426836 msg/s

Not too shabby, just a 16.5% improvement! 🙂

@michaelklishin michaelklishin modified the milestones: 4.2.0, 4.3.0 Sep 12, 2025
@michaelklishin michaelklishin merged commit c85ba2a into main Sep 12, 2025
287 checks passed
@michaelklishin michaelklishin deleted the khepri-read-concurrency branch September 12, 2025 19:40
michaelklishin added a commit that referenced this pull request Sep 12, 2025
Optimise Khepri for concurrent reads (backport #14530)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backport-v4.2.x khepri Khepri as the metadata store performance

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants