-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
Describe the bug
TLDR: We had a mismatch of data between rabbit_db_topic_exchange trie records and rabbit_db_binding records after a rolling rabbitmq server update (we saw some flow controlled channels so we decided to use a slightly higher instance size for our nodes).
This resulted in rabbitmqctl listing correct bindings, web UI displaying all correct bindings, but messages got routed in an unexpected way.
Rabbitmq server version used: RabbitMQ 3.13.7 Erlang 26.2.5.2
We have the following exchange setup for handling incoming requests:
1: Exchange eve-requests-v2
a: This has bindings to queues with the request 'type'. So each service we have will start it's own queue, and will bind it to eve-requests-v2 with all the requests 'types' it can handle.
b: this exchange also has a '#' wildcard binding to the eve-v3-requests exchange.
2: Exchange eve-v3-requests
a: has an alternate exchange, directing unmatched traffic to a third exchange 'unroutable-requests'
b: This has bindings to queues with the request 'type'. So each service we have will start it's own queue, and will bind it to eve-requests-v2 with all the requests 'types' it can handle.
3: Exchange unroutable-requests
a: a queue will have a '#' binding on this exchange, the purpose is that we wanted to have an alert if we have incoming messages that services do not expect.
So each one of our service creates it's own queue, and will bind it to both eve-requests-v2 and eve-v3-requests.
We're relying on rabbitmq to collapse and de-duplicate the queue destination when routing messages.
We started to observe some flow control as our traffic was growing (namely channels and connections being flow controlled, but now queues themselves) so our guess was that some queue processes can't keep up with traffic, so we wanted to change instance types in AWS to give more cpu resources (specifically more single core performance) to nodes.
We then did a rolling update on our rabbitmq cluster (of 3 nodes): we brought a new node up, joined it to the cluster, let things stabilize for a few minutes, then take an old smaller node out. Then rinse and repeat until we only have new nodes in the cluster.
When verifying the update, we checked with rabbitmqctl if all the queues and bindings that we expect are in place, and everything seemed good.
Then soon after this, we saw our system alerting us to some unroutable messages that should have been routed correctly. So we started digging.
Our expected case, with a service that was working after the rolling update:
rabbit_db_topic_exchange:match({resource,<<"eve-live">>,exchange,<<"eve-requests-v2">>}, <<"eve_assetholding_asset_api_spawnrequest.tranquility">>, Opts).
[{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-asset-holding-esgc_erh">>}]`
`rabbit_db_topic_exchange:match({resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}, <<"eve_assetholding_asset_api_spawnrequest.tranquility">>, Opts).
[{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-asset-holding-esgc_erh">>}]
So the same queue has the same binding on both eve-requests-v2 and eve-v3-requests exchanges.
A scenario that caused unroutable message alert in our system:
rabbit_db_topic_exchange:match({resource,<<"eve-live">>,exchange,<<"eve-requests-v2">>}, <<"eve_cosmetic_market_skin_listing_api_getallrequest.tranquility">>, Opts).
[{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>}]`
`rabbit_db_topic_exchange:match({resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}, <<"eve_cosmetic_market_skin_listing_api_getallrequest.tranquility">>, Opts).
[]
Since we had no match on eve-v3-requests the message got handed off to the alternate exchange, which causes our unroutable message alerts to be emitted.
But at the same time what we saw via the web UI or rabbitmqctl is this:
root@:~# rabbitmqctl list_bindings --vhost eve-live | grep eve-v3-requests | grep esg-eve-requests-eve-cosmetic-market
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_component_listing_api_getallrequest.# []
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_skin_listing_api_admin_getalllistedbycharacterrequest.# []
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_skin_listing_api_admin_setexpirationrequest.# []
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_skin_listing_api_getallrequest.# []
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_transaction_api_getallrequest.# []
eve-v3-requests exchange esg-eve-requests-eve-cosmetic-market-esgc_erh queue eve_cosmetic_market_transaction_api_getrequest.# []
root@:~# rabbitmqctl list_bindings --vhost eve-live | grep eve-v3-requests | grep esg-eve-requests-eve-cosmetic-market | wc -l
6
If I queried the bindings information from a shell attached to the rabbitmq cluster from rabbit_db_bindings, I got the same information:
io:format("~p", [
lists:filter(fun({binding, {resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}, _, {resource,<<"eve-live">>,queue,<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>}, _}) -> true; (_) -> false end,
rabbit_db_binding:get_all_for_source({resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}))
]).
[{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_component_listing_api_getallrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]},
{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_skin_listing_api_admin_getalllistedbycharacterrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]},
{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_skin_listing_api_admin_setexpirationrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]},
{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_skin_listing_api_getallrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]},
{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_transaction_api_getallrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]},
{binding,{resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>},
<<"eve_cosmetic_market_transaction_api_getrequest.#">>,
{resource,<<"eve-live">>,queue,
<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
[]}]
ok
If I queried the trie records though:
lists:map(fun(BK) -> rabbit_db_topic_exchange:trie_records_to_key(mnesia:dirty_read(rabbit_topic_trie_binding, BK)) end, lists:filter(fun({trie_binding, {resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}, _, {resource,<<"eve-live">>,queue,<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>}, _}) -> true; (_) -> false end, mnesia:dirty_all_keys(rabbit_topic_trie_binding))).
[[{#trie_binding{exchange_name = #resource{virtual_host = <<"eve-live">>,
kind = exchange,name = <<"eve-v3-requests">>},
node_id = <<202,152,109,239,157,189,40,151,231,221,58,199,
62,85,232,60>>,
destination = #resource{virtual_host = <<"eve-live">>,
kind = queue,
name = <<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>},
arguments = []},
["eve_cosmetic_market_skin_listing_api_admin_getalllistedbycharacterrequest",
"#"]}]]
I only found trie records for one binding.
So it seems binding information stored in these two places got desynced during the rolling update process.
(Trie information for the working service completely matches up with the other binding information, I can provide them if needed.)
Reproduction steps
We cannot consistently reproduce this issue. But when it happened:
- Do a rolling update on a cluster of 3 nodes, bringing a new node up, waiting a few minutes, bringing an old node down.
- Compare rabbit_db_bindings and rabbit_db_topic_exchange records, they sometimes diverge
Expected behavior
Expected bindings not just to be displayed correctly, but to route incoming messages correctly.
Additional context
Solved the temporary issue by executing this code snippet:
lists:map(fun(B) -> rabbit_db_topic_exchange:set(B) end, lists:filter(fun({binding, {resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}, _, {resource,<<"eve-live">>,queue,<<"esg-eve-requests-eve-cosmetic-market-esgc_erh">>}, _}) -> true; (_) -> false end, rabbit_db_binding:get_all_for_source({resource,<<"eve-live">>,exchange,<<"eve-v3-requests">>}))).
This reconstructed the trie records from the bindings stored and solved the issue we were having.
However we're no closer to understanding why did this happen in the first place (apart form this likely being an mnesia synchronization error).
Some of our services still rely on classic mirrored queues, so currently we can't update to 4.x and use khepri just yet.