-
Notifications
You must be signed in to change notification settings - Fork 4k
Fix regression with Khepri binding args #14543
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
This commit adds a test case for a regression/bug that occurs in Khepri. ``` make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=mnesia ``` succeeds, but ``` make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=khepri ``` fails. The problem is that ETS table `rabbit_khepri_index_route` cannot differentiate between two bindings with different binding arguments, and therefore deletes entries too early, leading to wrong routing decisions. The solution to this bug is to include the binding arguments in the `rabbit_khepri_index_route` projection, similar to how the binding args are also included in the `rabbit_index_route` Mnesia table. This bug/regression is an edge case and exists if the source exchange type is `direct` or `fanout` and if different bindings arguments are used by client apps. Note that such binding arguments are entirely ignored when RabbitMQ performs routing decisions for the `direct` or `fanout` exchange. However, there might be client apps that use binding arguments to add some metadata to the binding, for example `app-id` or `user` or `purpose` and might use this metadata as a form of reference counting in deciding when to delete `auto-delete` exchanges or just for informational/operational purposes.
Prior to this commit, this test case failed in 4.2 <-> 4.1 mixed version mode because the different nodes register different projections. Specifically, the old projection of the 4.1 node was registered. Independent of the test case, even when a rolling upgrade from 4.1 to this commit's branch completes, the old projection is still registered. It seems, what's missing is a Khepri machine version migration where the projection will be migrated from old to new. But that's outside the scope of this bug fix. We can add this mechanism separately.
|
@ansd says in an internal chat:
Another option is to create a new projection; in other words, rename the current one from @the-mikedavis: Do you have an opinion? |
|
Adding a v2 and eventually unregistering the v1 in |
|
Thank you! Just that I understand correctly: Should the v2 projection be registered upon node boot? This means it becomes available on the old nodes too in a mixed version cluster, but won't cause there any harm, right? |
|
Yes, I confirm that the newer node will initialise the new projection on boot and the ETS table will appear on all nodes, including an old one. That’s ok because the old node won’t use the new projection. The new node just need to support the new projection. |
|
Closing this PR in favour of #14546 |
Fixes #14533
This PR fixes the bug for clusters created in >=4.2.
For clusters being upgraded from <4.2 to >= 4.2, the bug still exists, which may be acceptable given the edge case for the bug to be triggered. To fix the bug for upgrades <4.2, RabbitMQ needs a way to migrate projections. One option is to unregister and register the projection in the
rabbitmq_4.2.0feature flag migration callback. This would require to block routing while this migration function runs.