Skip to content

Conversation

@SimonUnge
Copy link
Collaborator

@SimonUnge SimonUnge commented May 20, 2024

Proposed Changes

Let QQ run rabbit_quorum_queue:repair_amqqueue_nodes on tick to repair potential membership discrepancy (RA members compared to amqqueue state members)

See #11029

Types of Changes

What types of changes does your code introduce to this project?
Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes issue #NNNN)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause an observable behavior change in existing systems)
  • Documentation improvements (corrections, new content, etc)
  • Cosmetic change (whitespace, formatting, etc)
  • Build system and/or CI

Checklist

Put an x in the boxes that apply.
You can also fill these out after creating the PR.
If you're unsure about any of them, don't hesitate to ask on the mailing list.
We're here to help!
This is simply a reminder of what we are going to look for before merging your code.

  • I have read the CONTRIBUTING.md document
  • I have signed the CA (see https://cla.pivotal.io/sign/rabbitmq)
  • I have added tests that prove my fix is effective or that my feature works
  • All tests pass locally with my changes
  • If relevant, I have added necessary documentation to https://github.com/rabbitmq/rabbitmq-website
  • If relevant, I have added this change to the first version(s) in release-notes that I expect to introduce it

Further Comments

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution
you did and what alternatives you considered, etc.

@SimonUnge SimonUnge requested a review from kjnilsson May 20, 2024 20:51
@SimonUnge SimonUnge self-assigned this May 20, 2024
@michaelklishin michaelklishin changed the title QQ check if amqqueue record needs update on tick. QQs: check if schema database record needs an update on tick May 20, 2024
@michaelklishin
Copy link
Collaborator

This can introduce a lot of CPU burn in clusters with thousands of quorum queues. Has the CPU impact of this been measured?

@lukebakken
Copy link
Collaborator

Relevant: #7863

@lukebakken lukebakken force-pushed the qq_repair_amqqueue_on_tick branch from ce90f5d to 204900f Compare May 21, 2024 00:27
@SimonUnge
Copy link
Collaborator Author

This can introduce a lot of CPU burn in clusters with thousands of quorum queues. Has the CPU impact of this been measured?

Nope, but I will run some benchmark tests before final review!

@SimonUnge
Copy link
Collaborator Author

@kjnilsson Looking at the repair_amqqueue_nodes, what could perhaps be a bit costly is the ra:members(Leader) call, perhaps the members could be provided in the tick from RA itself?

(there are of course costs in actually repairing too, but I would assume that a queue actually needs repairing is unusual, and that a lot of queues needing it even more so)

@kjnilsson
Copy link
Contributor

@kjnilsson Looking at the repair_amqqueue_nodes, what could perhaps be a bit costly is the ra:members(Leader) call, perhaps the members could be provided in the tick from RA itself?

(there are of course costs in actually repairing too, but I would assume that a queue actually needs repairing is unusual, and that a lot of queues needing it even more so)

The ra_leaderboard should have the current members which you can use to avoid the ra:members/1 call back into the process.

@SimonUnge
Copy link
Collaborator Author

@kjnilsson Looking at the repair_amqqueue_nodes, what could perhaps be a bit costly is the ra:members(Leader) call, perhaps the members could be provided in the tick from RA itself?
(there are of course costs in actually repairing too, but I would assume that a queue actually needs repairing is unusual, and that a lot of queues needing it even more so)

The ra_leaderboard should have the current members which you can use to avoid the ra:members/1 call back into the process.

Aha, I'll update the function to use that one instead.

@SimonUnge SimonUnge marked this pull request as ready for review May 24, 2024 22:00
@SimonUnge
Copy link
Collaborator Author

@michaelklishin
Michal ran 10k QQs on a 3 node cluster (all 10k queue leaders on the same node), and produced some flame graphs (and, taught me how to do it, for the next time! Much obliged, thank you very much @mkuratczyk!)

With the addition of rabbit_quorum_queue:repair_amqqueue_nodes, and the change to call ra_leadership:lookup_members, the graphs show that 0.53% CPU time is spent in rabbit_quorum_queue:repair_amqqueue_nodes, 0.42% spent in ra_leadership:lookup_members.

image

@michaelklishin
Copy link
Collaborator

In other words, with the more efficient Ra leaderboard-based implementation we waste about 1% of the CPU with 10K quorum queues.

@SimonUnge
Copy link
Collaborator Author

SimonUnge commented May 28, 2024

In other words, with the more efficient Ra leaderboard-based implementation we waste about 1% of the CPU with 10K quorum queues.

I wrote it a bit weird, but it would be 0.5% CPU with 10k QQs in total (the 0.42 is part of the 0.53)

@SimonUnge SimonUnge force-pushed the qq_repair_amqqueue_on_tick branch from 15c80f7 to 83a0eed Compare May 29, 2024 20:17
@michaelklishin michaelklishin merged commit bd111f0 into rabbitmq:main Jun 3, 2024
@michaelklishin
Copy link
Collaborator

@Mergifyio backport v3.13.x

@mergify
Copy link

mergify bot commented Jun 3, 2024

backport v3.13.x

✅ Backports have been created

michaelklishin added a commit that referenced this pull request Jun 4, 2024
QQs: check if schema database record needs an update on tick (backport #11278)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants