Anti-affinity for queues #9679

0UserName · 2023-10-11T11:48:18Z

0UserName
Oct 11, 2023

Given: RMQ (v3.12.0 if matter) cluster consisting of 3 nodes.

There are several services. Each service operates with a set of N number of QQ. The volumes processed vary from service to service, i.e. some queues receive larger messages (and in greater quantities) than others.

Problem:

If a service that processes large amounts of data stops reading messages, this can lead to an overflow of the entire RMQ and a crash of the entire system.

Solution:

In order to reduce blast radius and protect services from each other, the idea arose to implement something like anti-affinity rules: use x-quorum-initial-group-size for queues and try to bind (via something like this: #8532 (I suppose this needs updating) them to certain cluster nodes -> queues of service A are located on nodes from 0 to 2, B from 3 to 5, C from 6 to 8. In this case, the system continues to operate within one cluster and the overflow of queues of one of the groups does not affect the state of the entire cluster and other groups, but this will require adding an additional 6 nodes to the cluster.

Actually, I would like to hear thoughts on this matter, to what extent would such a use of RMQ be reasonable?

Let's assume this is a working solution, then a question arises based on this discussion: #7209

What pitfalls may arise in the future in a scenario when the nodes providing the service fail - will the queues located on them automatically be promoted to other nodes of the cluster? Will it be possible to disable promotion per queue (if not, then there can be no talk of any anti-affinity)?

mkuratczyk · 2023-10-11T12:43:56Z

mkuratczyk
Oct 11, 2023
Maintainer

Please start by explaining what you mean by If a service that processes large amounts of data stops reading messages, this can lead to an overflow of the entire RMQ and a crash of the entire system.. Simply that if a queue keeps growing then the system will run out of memory (because QQs keep an in-memory index of messages)? If so, then setting the maximum queue length is a much simpler solution.

4 replies

0UserName Oct 11, 2023
Author

Please start by explaining what you mean by If a service that processes large amounts of data stops reading messages, this can lead to an overflow of the entire RMQ and a crash of the entire system.

I mean situation when RMQ crashes due to lack of RAM or disk space.

If so, then setting the maximum queue length is a much simpler solution.

There are difficulties with the fact that the size of messages, as well as their number, varies over time. Depends on external factors. Additionally, despite the limit, the cumulative number of messages among multiple services can still lead to a problem, the number of which also varies. In this case, relatively frequent revision of the limits will be required.

I was probably not entirely clear about the number of services. There are more services than A, B, C. We have groups of services working with N queues. It seems to me that it makes sense to isolate groups from each other.

But I'm not sure that what I'm suggesting is using the mentioned tools (RMQ it self, QQ with mentioned options, mentioned AP)I in the right way.

mkuratczyk Oct 11, 2023
Maintainer

Thanks. Still, there are additional safeguards that already exist:

memory/disk limits should block the publishers if the server is running out of memory/disk
Queue length limits can also be expressed in bytes (that's still per-queue, but the total is covered by a disk limit)
Prometheus metrics combined with PromQL's predict_linear function can warn about the problem before it even happens
Not to mentione your consumer app should also be monitored and trigger an alert if it's not consumign messages

0UserName Oct 11, 2023
Author

Well, points 3 and 4 don’t make much sense in the case of serious outage on the part of applications.

In other words, you don’t see the point in the approach I described?

mkuratczyk Oct 11, 2023
Maintainer

They don't make sense in case of the outage but they could prevent the outage in the first place. :)

Let's see what others think. Personally I feel like it's a complex feature with many corner cases. For example, if you actually loose, say, 3 out of 6 nodes, wouldn't you want the queue to be restarted on the remaining nodes? if so, that means ignoring affinity rules in some cases. If so, that also means there's a question of what to do when the nodes come back. You can have a look at Kubernetes pod affinity rules to see how many options there are and that still doesn't satisfy all users needs.

I'd personally focus on how to prevent the crash in the first place. There are multiple mechanisms I already mentioned. And if you really want to isolate queues by nodes, you can run multiple clusters for complete isolation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Anti-affinity for queues #9679

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 4 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Anti-affinity for queues #9679

Uh oh!

Uh oh!

0UserName Oct 11, 2023

Replies: 1 comment · 4 replies

Uh oh!

mkuratczyk Oct 11, 2023 Maintainer

Uh oh!

Uh oh!

0UserName Oct 11, 2023 Author

Uh oh!

mkuratczyk Oct 11, 2023 Maintainer

Uh oh!

0UserName Oct 11, 2023 Author

Uh oh!

mkuratczyk Oct 11, 2023 Maintainer

0UserName
Oct 11, 2023

Replies: 1 comment 4 replies

mkuratczyk
Oct 11, 2023
Maintainer

0UserName Oct 11, 2023
Author

mkuratczyk Oct 11, 2023
Maintainer

0UserName Oct 11, 2023
Author

mkuratczyk Oct 11, 2023
Maintainer