Classic mirrored queue becomes unresponsive on lazy queue policy change #5095

b-borden · 2022-06-21T16:21:45Z

b-borden
Jun 21, 2022

Hi there,

I am experiencing a stuck classic mirrored queue on my test RabbitMQ 3.9.20 cluster broker due to a queue’s effective policy change. While stuck, the queue blocks both publishing and consumption of messages and is listed as unresponsive by rabbitmqctl list_unresponsive_queues. Otherwise I can still query the queue metadata from the management API, which reports the queue as having “running” state.

When this happens, scheduler CPU utilization increases and stays elevated. Observing rabbitmq_top, it appears that RabbitMQ has a high number of reductions on the queue’s mirror_queue_slave process and is unable to complete rabbit_variable_queue:convert_to_lazy.

Reproduction

It can be reproduced by deleting the policy in use by a queue, causing the queue to apply a lower priority policy that has lazy queues enabled:

Start a cluster of two 3.9.20 nodes. Node configurations below.
Apply the following HA queue policies:
1. rabbitmqctl -n node1@localhost set_policy policy-0 ".*" '{"ha-mode":"all", "ha-sync-mode": "automatic", "queue-mode": "lazy"}' —priority 0
2. rabbitmqctl -n node1@localhost set_policy policy-1 ".*" '{"ha-mode":"all", "ha-sync-mode": "automatic"}' —priority 1
Create a durable classic mirrored queue
Publish 1 million messages to the queue (less reliably reproduced with fewer messages, possibly as low as ~300k):
1. delivery mode: persistent
2. size: 800 Bytes / msg
Remove the higher priority policy:
1. rabbitmqctl -n node1@localhost clear_policy policy-1

Node1 configuration:

listeners.tcp.default = 127.0.0.1:5671

management.tcp.ip   = 127.0.0.1
management.tcp.port = 15671

vm_memory_high_watermark.absolute = 1G

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = node1@localhost
cluster_formation.classic_config.nodes.2 = node2@localhost

Node2 configuration:

listeners.tcp.default = 127.0.0.1:5672

management.tcp.ip   = 127.0.0.1
management.tcp.port = 15672

vm_memory_high_watermark.absolute = 1G

cluster_formation.peer_discovery_backend = classic_config
cluster_formation.classic_config.nodes.1 = node1@localhost
cluster_formation.classic_config.nodes.2 = node2@localhost

lukebakken · 2022-06-21T23:18:42Z

lukebakken
Jun 21, 2022
Maintainer

Thanks for the report. I used your steps to try out some scenarios, all using the master branch of RabbitMQ and Erlang 25.0.2:

Scenario 1

3-node cluster (2 nodes are never recommended)
1 million 800-byte messages, persistent flag set
No high watermark

During publishing, I can see memory use of each node grow, as expected, because there is sufficient RAM to keep them in memory without triggering paging to disk. After clearing policy-1 I observe memory use to increase a bit on each node briefly, then drop off rapidly as the messages are removed from RAM.

Scenario 2

Same as 1, with vm_memory_high_watermark.absolute = 1G set.

During publishing I can see memory stay well below the high watermark as RabbitMQ pages messages to disk. When publishing completes, almost all of the messages are paged to disk (there is only about 10-20MiB in RAM) and memory usage is between 300 - 400MiB per node. When I clear the policy, nothing exciting happens at all, and I can see that the queues are fully "lazy" i.e. only 2K messages in RAM.

So it seems that I can't immediately reproduce this issue. I will return to it tomorrow using RabbitMQ 3.9.20.

What version of Erlang are you using?

0 replies

b-borden · 2022-06-22T00:29:03Z

b-borden
Jun 22, 2022
Author

Thanks for looking into this.

To reproduce, I used:

erlang 24.3.4
elixir 1.12.3

I'll also add that I'm reproducing this using bazel run broker from a clone of the repo at the 3.9.20 tag (bb39bdb). Please let me know if there are any other things I can check to help with reproduction.

edit: may also try increasing the number of messages. I found it was fairly reliably reproducible at 1 mil messages, but it does seem the fewer the number of messages the less reliable it becomes.

1 reply

lukebakken Jun 22, 2022
Maintainer

If you'd like to check my reproduction steps, I have put them into the following script:

https://github.com/lukebakken/rabbitmq-server-5086/blob/main/repro.sh

I tried reproducing the issue today using 3.9.20 and 1.5mil 1024-byte messages to no avail.

b-borden · 2022-06-28T01:06:37Z

b-borden
Jun 28, 2022
Author

Thanks for the reproduction repro and scripts. I was able to reproduce the issue again using your repo (with some adjustments). I reproduced using only the 1 mil messages and 3 nodes.

The adjustments are:

I had no global erlang or elixir (was using kerl/kiex), so make was failing to build because or missing erl (and mix, etc.), even after asdf local. I'm curious if the make build is picking up a different global erlang version in your case (rather than the asdf one specified)
The repro failed to build properly on mac, so I used an EC2 host where I only used kiex/kerl (haven't installed asdf yet)
I had to manually create the queue, as I noticed there is no queue declaration in the script.

8 replies

lukebakken Jun 28, 2022
Maintainer

Sorry, I updated the script as it was running which is why you see that in the output. I'll re-run the script.

Are you using the rabbitmq_top plugin to get those stats?

b-borden Jun 28, 2022
Author

Yes, I'm just looking at the rabbitmq_top view in the management API (http://localhost:15672/#/top)

lukebakken Jun 28, 2022
Maintainer

Ah, finally -

lukebakken Jun 28, 2022
Maintainer

I'm seeing if this happens with the latest Erlang and RabbitMQ.

lukebakken Jun 28, 2022
Maintainer

This doesn't happen with Erlang 25.0.2 and RabbitMQ 3.10.5 (at least, in my env). I'll open an issue for 3.9.20. Seems like there may be a loop going on via the bump_reduce_memory_use message.

lukebakken · 2022-06-28T19:48:07Z

lukebakken
Jun 28, 2022
Maintainer

#5118

1 reply

b-borden Jun 28, 2022
Author

Thank you!

Classic mirrored queue becomes unresponsive on lazy queue policy change #5095

Uh oh!

b-borden Jun 21, 2022

Reproduction

Replies: 4 comments · 10 replies

Uh oh!

lukebakken Jun 21, 2022 Maintainer

Scenario 1

Scenario 2

Uh oh!

Uh oh!

b-borden Jun 22, 2022 Author

Uh oh!

lukebakken Jun 22, 2022 Maintainer

Uh oh!

b-borden Jun 28, 2022 Author

Uh oh!

Uh oh!

lukebakken Jun 28, 2022 Maintainer

Uh oh!

b-borden Jun 28, 2022 Author

Uh oh!

lukebakken Jun 28, 2022 Maintainer

Uh oh!

lukebakken Jun 28, 2022 Maintainer

Uh oh!

Uh oh!

lukebakken Jun 28, 2022 Maintainer

Uh oh!

lukebakken Jun 28, 2022 Maintainer

Uh oh!

b-borden Jun 28, 2022 Author

b-borden
Jun 21, 2022

Replies: 4 comments 10 replies

lukebakken
Jun 21, 2022
Maintainer

b-borden
Jun 22, 2022
Author

lukebakken Jun 22, 2022
Maintainer

b-borden
Jun 28, 2022
Author

lukebakken Jun 28, 2022
Maintainer

b-borden Jun 28, 2022
Author

lukebakken Jun 28, 2022
Maintainer

lukebakken Jun 28, 2022
Maintainer

lukebakken Jun 28, 2022
Maintainer

lukebakken
Jun 28, 2022
Maintainer

b-borden Jun 28, 2022
Author