Replies: 6 comments 11 replies
-
Could you explain these in more detail:
|
Beta Was this translation helpful? Give feedback.
-
We cannot suggest anything without node logs. "Freezing nodes" can be waiting for a known peer to become reachable which happens mid-boot. Nodes in this state won't accept any client connections. Newly booted nodes can be performing peer discovery and again, won't finish booting and accept any client connections or start any plugins. This sounds like a health check and a readiness probe (the idea is not limited to Kubernetes) to me. |
Beta Was this translation helpful? Give feedback.
-
Here is a copy of the log over the past few days. Keep in mind we have restarted the service manually at times as part of the troubleshooting. Also, I restarted the service sometime yesterday and at the moment it is still running and rabbitmq-diagnostics status outputs metrics successfully. |
Beta Was this translation helpful? Give feedback.
-
Health check as of now |
Beta Was this translation helpful? Give feedback.
-
Readiness probe results: C:\Program Files\RabbitMQ Server\rabbitmq_server-3.10.1\sbin>rabbitmq-plugins -q list --enabled --minimal C:\Program Files\RabbitMQ Server\rabbitmq_server-3.10.1\sbin>rabbitmq-plugins -q is_enabled rabbitmq_shovel |
Beta Was this translation helpful? Give feedback.
-
Sometime within the last 1.5-2 hours, the node has appeared to shut down again. Here is the updated diagnostics status: Updated rabbit logs: Erlang cookies still match between C:\Users{network user account} and C:\Windows\System32\config\systemprofile. |
Beta Was this translation helpful? Give feedback.
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
RabbitMQ v3.11.8 (default setup, single node) x64
Erlang 25.1.2 (13.1.2) - x64
Windows Server 2016 - x64
FortiEDR Protection
Certero Platform
Installed with admin network user account
DFS / Network File Share
Erlang cookies match between C:\Users{account} and C:\Windows\System32\config\systemprofile
All client windows services utilizing Rabbit MQ have been temporarily disabled to avoid flooding the Event Logs with Broker Unreachable.
Issue: Upon restarting the Rabbit MQ, service, it will run up to several hours allowing connections and access to the management plugin. Then, at some random point in time (generally 8-10 hours), the node seems to become unreachable (service is still running) and I can no longer access the management plugin in the browser (it just spins). Then if I restart the service, behavior repeats.
One of our production VM's has been running Rabbit MQ without issues. It has the same configuration with the exception of the OS still being on Windows Server 2012 x64. However, prior to mid January 2023, all VM's running Rabbit MQ had Server 2012 x64 and worked without issues. Then the lower environment VM's began the issue I'm reporting.
I've worked with my dev ops team to do real time monitoring to try and capture any resource issues or errors, but we didn't notice any problems with resources and Rabbit MQ never logs any warnings or errors to the server log or Event logs.
After Rabbit MQ seemingly freezes up and I try to access the management plugin in the browser (which hangs), netstat does report CLOSE_WAIT's on port 15672. From our troubleshooting though, this seems specific to the node becoming unreachable issue.
I agree that there seems to be something "special" about these lower environment VM's that are now causing Rabbit MQ to be unstable, but I haven't found it.
I'd like to get this resolved, so let's keep the conversation going.
Beta Was this translation helpful? Give feedback.
All reactions