Rabbit MQ intermittent broker unreachable until service is restarted. #7406

kwilla822 · 2023-02-23T21:59:44Z

kwilla822
Feb 23, 2023

RabbitMQ v3.11.8 (default setup, single node) x64
Erlang 25.1.2 (13.1.2) - x64
Windows Server 2016 - x64
FortiEDR Protection
Certero Platform
Installed with admin network user account
DFS / Network File Share
Erlang cookies match between C:\Users{account} and C:\Windows\System32\config\systemprofile

All client windows services utilizing Rabbit MQ have been temporarily disabled to avoid flooding the Event Logs with Broker Unreachable.

Issue: Upon restarting the Rabbit MQ, service, it will run up to several hours allowing connections and access to the management plugin. Then, at some random point in time (generally 8-10 hours), the node seems to become unreachable (service is still running) and I can no longer access the management plugin in the browser (it just spins). Then if I restart the service, behavior repeats.

One of our production VM's has been running Rabbit MQ without issues. It has the same configuration with the exception of the OS still being on Windows Server 2012 x64. However, prior to mid January 2023, all VM's running Rabbit MQ had Server 2012 x64 and worked without issues. Then the lower environment VM's began the issue I'm reporting.

I've worked with my dev ops team to do real time monitoring to try and capture any resource issues or errors, but we didn't notice any problems with resources and Rabbit MQ never logs any warnings or errors to the server log or Event logs.

After Rabbit MQ seemingly freezes up and I try to access the management plugin in the browser (which hangs), netstat does report CLOSE_WAIT's on port 15672. From our troubleshooting though, this seems specific to the node becoming unreachable issue.

I agree that there seems to be something "special" about these lower environment VM's that are now causing Rabbit MQ to be unstable, but I haven't found it.

I'd like to get this resolved, so let's keep the conversation going.

lukebakken · 2023-02-23T22:03:34Z

lukebakken
Feb 23, 2023
Maintainer

If Windows is running in a VM, what virtualization environment is being used?
What other services are running? Security and virus scanners do weird things.

Could you explain these in more detail:

FortiEDR Protection
Certero Platform
DFS / Network File Share - is this for the RabbitMQ data directory or for the entire disk?

2 replies

kwilla822 Feb 23, 2023
Author

Lower Environment VM's are virtualized with Nutanix (Rabbit MQ not working here)
Production Environment VM is virtualized with VMWare (but moving soon to Nutanix) (Rabbit MQ working here)

I've had the discussion about this virtualization difference with my dev ops team and while it doesn't seem to be what is causing the issue with Rabbit MQ (that we know of), no one will commit 100%. In other words, nothing obvious has been identified yet.

FortiEDR Protection - This is a service that monitors our network for potential threats on running processes. If it is not familiar with a particular process, then it gets blocked until an exception is created for it. It keeps a history accessible on each VM to see what exe's have been blocked. Best I can identify, this is not interfering with Rabbit MQ at the moment (i.e. erl.exe, erlsrv.exe, epdm.exe)

Certero Platform - This platform is being used by our Infrastructure team to push updates to work stations and VM's across the network.

DFS / Network File Share - is this for the RabbitMQ data directory or for the entire disk? Entire disk.

lukebakken Feb 24, 2023
Maintainer

Lower Environment VM's are virtualized with Nutanix (Rabbit MQ not working here)
Production Environment VM is virtualized with VMWare (but moving soon to Nutanix) (Rabbit MQ working here)

This seems to be the big difference and most likely explanation, eh?

I'm assuming that "Certero" and "FortiEDR" are running and configured the same in both environments.

michaelklishin · 2023-02-24T06:30:02Z

michaelklishin
Feb 24, 2023
Maintainer

We cannot suggest anything without node logs. "Freezing nodes" can be waiting for a known peer to become reachable which happens mid-boot. Nodes in this state won't accept any client connections.

Newly booted nodes can be performing peer discovery and again, won't finish booting and accept any client connections or start any plugins.

This sounds like a health check and a readiness probe (the idea is not limited to Kubernetes) to me.

0 replies

kwilla822 · 2023-02-24T13:53:00Z

kwilla822
Feb 24, 2023
Author

Here is a copy of the log over the past few days. Keep in mind we have restarted the service manually at times as part of the troubleshooting. Also, I restarted the service sometime yesterday and at the moment it is still running and rabbitmq-diagnostics status outputs metrics successfully.

rabbit.log

1 reply

lukebakken Feb 24, 2023
Maintainer

Thanks, there's nothing that stands out. If you want to get rid of the handle.exe warnings, download it from here and install into the PATH. I usually recommend C:\Windows

kwilla822 · 2023-02-24T14:04:23Z

kwilla822
Feb 24, 2023
Author

Health check as of now

rabbitmq-diagnostics.txt

0 replies

kwilla822 · 2023-02-24T14:07:56Z

kwilla822
Feb 24, 2023
Author

Readiness probe results:

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.10.1\sbin>rabbitmq-plugins -q list --enabled --minimal
rabbitmq_management

C:\Program Files\RabbitMQ Server\rabbitmq_server-3.10.1\sbin>rabbitmq-plugins -q is_enabled rabbitmq_shovel
Error:
Plugin rabbitmq_shovel is not enabled on node rabbit@--**. Enabled plugins and dependencies: amqp_client, cowboy, cowlib, rabbitmq_management, rabbitmq_management_agent, rabbitmq_web_dispatch

0 replies

kwilla822 · 2023-02-24T16:09:01Z

kwilla822
Feb 24, 2023
Author

Sometime within the last 1.5-2 hours, the node has appeared to shut down again.

Here is the updated diagnostics status:
diagnostics.txt

Updated rabbit logs:
updated-rabbit.log

Erlang cookies still match between C:\Users{network user account} and C:\Windows\System32\config\systemprofile.

8 replies

lukebakken Feb 24, 2023
Maintainer

Thanks. What happens when you run that command from another node in the cluster?

kwilla822 Feb 25, 2023
Author

Thanks. What happens when you run that command from another node in the cluster?

I'm not running any other nodes.... only single node. Also, unless it's part of the default installation/configuration, I didn't intentionally setup a cluster here.

add0212 Feb 25, 2023

Hi,

I'm facing similar issue with RabbitMQ version 'RabbitMQ 3.9.14', 'Erlang 24.2.1'. After sometime management console dashboard stops responding and other machines are not able to connect and get message from the queue. Restarting RabbitMQ service makes it work fine.

I ran the powershell and got the same result as already mentioned in the comments, and there is nothing in the RabbitMQ logs.

These logs are from the other machine that tries to connect to RabbitMQ to get message. Both the machines are Windows servers.

2023-Feb-25 09:13:28: ERROR:connection_workflow.py:334: Timeout while setting up AMQP to 'servername.domain.com'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('IP address', 5671)); ssl=True
2023-Feb-25 09:13:28: ERROR:connection_workflow.py:291: AMQPConnector - reporting failure: AMQPConnectorAMQPHandshakeError: AMQPConnectorStackTimeout("Timeout during AMQP handshake'servername.domain.com'/(<AddressFamily.AF_INET: 2>, <SocketKind.SOCK_STREAM: 1>, 6, '', ('IP address', 5671)); ssl=True")

lukebakken Feb 25, 2023
Maintainer

So far all evidence points to something in the environment that is suspending RabbitMQ or blocking connections to RabbitMQ. We have seen this before with products like VMware ESXi and vMotion, for instance. @add0212 please read the conversation up until now, especially the initial post and my response.

add0212 Feb 25, 2023

Thanks, I will check the whole discussion and will get back.

Rabbit MQ intermittent broker unreachable until service is restarted. #7406

Uh oh!

Uh oh!

kwilla822 Feb 23, 2023

Replies: 6 comments · 11 replies

Uh oh!

lukebakken Feb 23, 2023 Maintainer

Uh oh!

Uh oh!

kwilla822 Feb 23, 2023 Author

Uh oh!

lukebakken Feb 24, 2023 Maintainer

Uh oh!

michaelklishin Feb 24, 2023 Maintainer

Uh oh!

Uh oh!

kwilla822 Feb 24, 2023 Author

Uh oh!

lukebakken Feb 24, 2023 Maintainer

Uh oh!

kwilla822 Feb 24, 2023 Author

Uh oh!

kwilla822 Feb 24, 2023 Author

Uh oh!

kwilla822 Feb 24, 2023 Author

Uh oh!

lukebakken Feb 24, 2023 Maintainer

Uh oh!

kwilla822 Feb 25, 2023 Author

Uh oh!

add0212 Feb 25, 2023

Uh oh!

lukebakken Feb 25, 2023 Maintainer

Uh oh!

add0212 Feb 25, 2023

kwilla822
Feb 23, 2023

Replies: 6 comments 11 replies

lukebakken
Feb 23, 2023
Maintainer

kwilla822 Feb 23, 2023
Author

lukebakken Feb 24, 2023
Maintainer

michaelklishin
Feb 24, 2023
Maintainer

kwilla822
Feb 24, 2023
Author

lukebakken Feb 24, 2023
Maintainer

kwilla822
Feb 24, 2023
Author

kwilla822
Feb 24, 2023
Author

kwilla822
Feb 24, 2023
Author

lukebakken Feb 24, 2023
Maintainer

kwilla822 Feb 25, 2023
Author

lukebakken Feb 25, 2023
Maintainer