Fix regression with Khepri binding args #14543

ansd · 2025-09-15T07:51:18Z

This PR fixes the bug for clusters created in >=4.2.
For clusters being upgraded from <4.2 to >= 4.2, the bug still exists, which may be acceptable given the edge case for the bug to be triggered. To fix the bug for upgrades <4.2, RabbitMQ needs a way to migrate projections. One option is to unregister and register the projection in the rabbitmq_4.2.0 feature flag migration callback. This would require to block routing while this migration function runs.

This commit adds a test case for a regression/bug that occurs in Khepri. ``` make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=mnesia ``` succeeds, but ``` make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=khepri ``` fails. The problem is that ETS table `rabbit_khepri_index_route` cannot differentiate between two bindings with different binding arguments, and therefore deletes entries too early, leading to wrong routing decisions. The solution to this bug is to include the binding arguments in the `rabbit_khepri_index_route` projection, similar to how the binding args are also included in the `rabbit_index_route` Mnesia table. This bug/regression is an edge case and exists if the source exchange type is `direct` or `fanout` and if different bindings arguments are used by client apps. Note that such binding arguments are entirely ignored when RabbitMQ performs routing decisions for the `direct` or `fanout` exchange. However, there might be client apps that use binding arguments to add some metadata to the binding, for example `app-id` or `user` or `purpose` and might use this metadata as a form of reference counting in deciding when to delete `auto-delete` exchanges or just for informational/operational purposes.

Fix #14533

Prior to this commit, this test case failed in 4.2 <-> 4.1 mixed version mode because the different nodes register different projections. Specifically, the old projection of the 4.1 node was registered. Independent of the test case, even when a rolling upgrade from 4.1 to this commit's branch completes, the old projection is still registered. It seems, what's missing is a Khepri machine version migration where the projection will be migrated from old to new. But that's outside the scope of this bug fix. We can add this mechanism separately.

dumbbell · 2025-09-15T13:54:30Z

@ansd says in an internal chat:

The PR fixes this issue for clusters created in 4.2+. However, if you upgrade from 4.1 with Khepri enabled previously, the bug still occurs. That's because the old projection table is still being used. The question is how do you upgrade projections?

One option I can think of: deregister the projection and then register again in the rabbitmq_4.2.0 feature flag migration function. That means we need to block routing though while the migration function runs.

Another option I can think of: run a migration in the machine_version upgrade. But it seems this API is meant to be private to the Khepri app and not meant to be hooked into by the rabbit app.

Yet another option for this specific PR: Provide the fix as is for new clusters created in 4.2+. Given this is a rare edge case, we can leave it unfixed for clusters created in 4.1.

Another option is to create a new projection; in other words, rename the current one from rabbit_khepri_index_route to rabbit_khepri_index_route_v2. The node running RabbitMQ 4.2 will get the new fixed projection and an old node will continue to use the old projection. We could then unregister the old projection in the migration function of the rabbitmq_4.2.0 feature flag. I think this is a good middle ground: the old node continues to suffer from the bug, but as you said, that’s probably ok because it is a rare use case.

@the-mikedavis: Do you have an opinion?

the-mikedavis · 2025-09-15T15:38:46Z

Adding a v2 and eventually unregistering the v1 in rabbitmq_4.2.0 also sounds like a good middle-ground to me 👍

ansd · 2025-09-15T15:42:02Z

Thank you! Just that I understand correctly: Should the v2 projection be registered upon node boot? This means it becomes available on the old nodes too in a mixed version cluster, but won't cause there any harm, right?

dumbbell · 2025-09-15T15:52:19Z

Yes, I confirm that the newer node will initialise the new projection on boot and the ETS table will appear on all nodes, including an old one. That’s ok because the old node won’t use the new projection. The new node just need to support the new projection.

ansd · 2025-09-16T11:05:09Z

Closing this PR in favour of #14546

ansd added 2 commits September 15, 2025 08:59

Fix regression with Khepri binding args

6786e31

Fix #14533

ansd added the backport-v4.2.x label Sep 15, 2025

ansd marked this pull request as ready for review September 15, 2025 12:10

ansd requested review from dcorbacho, dumbbell and the-mikedavis September 15, 2025 12:10

ansd mentioned this pull request Sep 15, 2025

Speed up fanout exchange #14546

Merged

ansd marked this pull request as draft September 15, 2025 15:59

ansd closed this Sep 16, 2025

mergify bot mentioned this pull request Sep 17, 2025

Speed up fanout exchange (backport #14546) #14563

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix regression with Khepri binding args #14543

Fix regression with Khepri binding args #14543

Uh oh!

ansd commented Sep 15, 2025 •

edited

Loading

Uh oh!

dumbbell commented Sep 15, 2025

Uh oh!

the-mikedavis commented Sep 15, 2025

Uh oh!

ansd commented Sep 15, 2025 •

edited

Loading

Uh oh!

dumbbell commented Sep 15, 2025

Uh oh!

ansd commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix regression with Khepri binding args #14543

Fix regression with Khepri binding args #14543

Uh oh!

Conversation

ansd commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumbbell commented Sep 15, 2025

Uh oh!

the-mikedavis commented Sep 15, 2025

Uh oh!

ansd commented Sep 15, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

dumbbell commented Sep 15, 2025

Uh oh!

ansd commented Sep 16, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

ansd commented Sep 15, 2025 •

edited

Loading

ansd commented Sep 15, 2025 •

edited

Loading