HTTP API get '/queues' response delay up to 40+seconds when 1 of 3 cluster nodes goes down (quorum queues) #6048

fare1990 · 2022-10-07T16:11:57Z

fare1990
Oct 7, 2022

RabbitMQ 3.10.6, Erlang 23.3.4.7. 3-node cluster, quorum queues.

I'm faced with the issue using RabbitMQ cluster (3 nodes), quorum queues & management-plugin's REST API.

In conditions when all nodes are up & running GET request for queues list (GET /api/queues) takes at most 300ms.

If we shutdown one of cluster node completely (with OS) - the same request started to hang for up to 40 seconds.

After short investigation figured out:

Only HTTP 'api/queues' endpoint affected, AMQP operations continue to work well. Looks like other HTTP endpoints that do not perform queue metrics collection working well too.
Issue happens when EPMD port 4369 became unreachable on specific node for other cluster member nodes.
Issue does not reproduce when classic queues mirroring used, only with quorum queues.
Playing with configuration flags to disable metrics gathering & with tick time - no effect.

Steps to reproduce:

Configure cluster with 3 nodes, enable management plugin for all.
Config example:

cluster_name                = somename
cluster_partition_handling  = pause_minority
cluster_formation.node_type = disc

cluster_formation.classic_config.nodes.1 = [[email protected]](mailto:[email protected])
cluster_formation.classic_config.nodes.2 = [[email protected]](mailto:[email protected])
cluster_formation.classic_config.nodes.3 = [[email protected]](mailto:[email protected])

ssl_options.cacertfile           = /somefolder/ssl/ca.crt
ssl_options.certfile             =  /somefolder/ssl/server.crt
ssl_options.keyfile              =  /somefolder/ssl/server.key
ssl_options.verify               = verify_peer
ssl_options.fail_if_no_peer_cert = true

ssl_options.versions.1 = tlsv1.3
ssl_options.ciphers.1  = TLS_AES_256_GCM_SHA384
ssl_options.ciphers.2  = TLS_AES_128_GCM_SHA256
ssl_options.ciphers.3  = TLS_CHACHA20_POLY1305_SHA256
ssl_options.ciphers.4  = TLS_AES_128_CCM_SHA256
ssl_options.ciphers.5  = TLS_AES_128_CCM_8_SHA256

vm_memory_high_watermark.absolute = 2048MiB

net_ticktime = 60
consumer_timeout = 43200000

management.tcp.port = 15672
management.tcp.ip   = 127.0.0.1

RABBITMQ_USE_LONGNAME was set to true.
2) Create some number of durable Quorum queues, for example by setting x-queue-type=quorum flag.

3)Stop OS on one of the node (zmpha-wfan-10024-master1.wflab.io for example), perform HTTP GET request to any other node.
Alternative: you can stop RabbitMQ server on the node (everything still works well) and then block traffic on this node for port 4369 (iptables -A INPUT -p tcp -s zmpha-wfan-10024-master2.wflab.io --dport 4369 -j DROP && iptables -A INPUT -p tcp -s zmpha-wfan-10024-master3.wflab.io --dport 4369 -j DROP) - after this REST API calls started to hang.

This behavior block some pretty standard maintenance operations, cause in our scenario we need to periodically gather queue statistics.

Answered by michaelklishin

Oct 7, 2022

This works expected.

When you query all queues or all connections or other "all things" endpoints, the node that handles the request aggregates the results from its peers and returns them to the client. There is a certain timeout involved. If one node is disconnected without being shut down, all requests to it from its peers will block until they time out.

Monitoring using GET /api/queues, GET /api/connections and in late 2022, using HTTP API queries at all is wrong. It is very common to see people use those endpoints to get a single field from a single object. That's really wasteful, as most of the metrics returned are not used at all.

This problem is not present in the Prometheus endpoint…

View full answer

michaelklishin · 2022-10-07T16:22:22Z

michaelklishin
Oct 7, 2022
Maintainer

This works expected.

When you query all queues or all connections or other "all things" endpoints, the node that handles the request aggregates the results from its peers and returns them to the client. There is a certain timeout involved. If one node is disconnected without being shut down, all requests to it from its peers will block until they time out.

Monitoring using GET /api/queues, GET /api/connections and in late 2022, using HTTP API queries at all is wrong. It is very common to see people use those endpoints to get a single field from a single object. That's really wasteful, as most of the metrics returned are not used at all.

This problem is not present in the Prometheus endpoint because every node returns only its own stats, and the aggregation is done at a later point by tools such as Grafana. Prometheus endpoing scraping is recommended for other reasons which are documented in the RabbitMQ guide on monitoring.

24 replies

fare1990 Oct 18, 2022
Author

Even with only name column requested - same delays

kjnilsson Oct 19, 2022
Maintainer

Ok I can see now that there are some paths in the code that do fanout stat collection even if the fields aren't included.

I've started some tweaks in this PR: #6183

If you're at all able to test it in your environment it would of course be helpful and speed eventual resolution up.

fare1990 Nov 2, 2022
Author

@kjnilsson Great, thanks!

I'm able to test it with rpm available

fare1990 Nov 15, 2022
Author

@kjnilsson @michaelklishin
I've tested 3.10.11 with improvement & it works perfectly. Thanks!

kjnilsson Nov 15, 2022
Maintainer

Great thanks for letting us know.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

HTTP API get '/queues' response delay up to 40+seconds when 1 of 3 cluster nodes goes down (quorum queues) #6048

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 24 replies

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

HTTP API get '/queues' response delay up to 40+seconds when 1 of 3 cluster nodes goes down (quorum queues) #6048

Uh oh!

Uh oh!

fare1990 Oct 7, 2022

Replies: 1 comment · 24 replies

Uh oh!

michaelklishin Oct 7, 2022 Maintainer

Uh oh!

fare1990 Oct 18, 2022 Author

Uh oh!

Uh oh!

kjnilsson Oct 19, 2022 Maintainer

Uh oh!

fare1990 Nov 2, 2022 Author

Uh oh!

fare1990 Nov 15, 2022 Author

Uh oh!

kjnilsson Nov 15, 2022 Maintainer

fare1990
Oct 7, 2022

Replies: 1 comment 24 replies

michaelklishin
Oct 7, 2022
Maintainer

fare1990 Oct 18, 2022
Author

kjnilsson Oct 19, 2022
Maintainer

fare1990 Nov 2, 2022
Author

fare1990 Nov 15, 2022
Author

kjnilsson Nov 15, 2022
Maintainer