(Exchange federation) Intermittent shutdown of federation links #15352
-
Describe the bugHi, we use celery in a distributed environment with federated RMQ exchanges. We recently upgraded RabbitMQ from 3.10.25 to 4.2.1 (Erlang 27.3.4.6) and almost every day one particular federated exchange link gets shutdown:
The only way to fix it to restart the downstream RMQ instance. When I look at the RMQ logs I see this error. Heartbeat timeoutOur code is mostly the same and the federation configuration has not changed. |
Beta Was this translation helpful? Give feedback.
Replies: 5 comments 3 replies
-
|
Please familiarize yourself with GitHub's features for formatting comments when you have to provide a lot of text: Pasting a wall of text, as you did, is lazy. I re-formatted your text for you. With regard to the federation error, it's pretty clear: You've provided only this information for us to work with:
Heartbeat timeouts are usually due to network devices interfering between an AMQP client (your downstream broker) and AMQP server (the upstream). You should take a look here at our general guidelines for how to report RabbitMQ issues: Hopefully those questions will lead you to a root cause. |
Beta Was this translation helpful? Give feedback.
-
|
@selim1965 those links run into missed heartbeats. Our team cannot help you with those. I don't see why that failure scenario would be handled differently from the rest. If you can reproduce this behavior with ToxiProxy, we'd be interested in learning more. We cannot tell you what may be preventing that link from restarting and reconnecting, see Troubleshooting Network Connectivity. False positives from heartbeats are very rare assuming a reasonably high value is used (e.g. not 1-2 seconds, those are guaranteed to produce false positives). Restarting nodes should not be necessary. To restart federation links, you can (pick one or more options, they do not depend on one another):
|
Beta Was this translation helpful? Give feedback.
-
|
Thank you @michaelklishin, we'll continue troubleshooting and take your suggestions into account. The errors indicate a |
Beta Was this translation helpful? Give feedback.
-
|
We have our heartbeat set to 10 seconds currently: |
Beta Was this translation helpful? Give feedback.
-
|
Just wanted to give an update. We have not seen this happen for the last week so. Maybe it was an isolated incident since we do have remote studios where we lose connection once in a while. Do you recommend that we increase the heartbeat timeout from 10 to 30 ? Thanks |
Beta Was this translation helpful? Give feedback.

@selim1965 my earlier recommendation quite explicitly referred to Heartbeats and TCP Proxies.
Per your own words, the hypothesis was correct. If so, what exactly would lead you to believe that going back to higher heartbeat values is a good idea? I won't take "I just want to clarify" for an answer, we are not your "free RabbitMQ DevOps on the Internet".
A heartbeat frame takes less than 20 bytes and with a 10 second timeout, happens every 5 seconds (with correctly implemented clients, or every 10 seconds with others). Both sound acceptable to me.
I cannot rule out other changes or factors but the doc section describes a very specific scenario.
Please take it from here.