Skip to content

Conversation

@ansd
Copy link
Member

@ansd ansd commented Sep 15, 2025

Fixes #14533

This PR fixes the bug for clusters created in >=4.2.
For clusters being upgraded from <4.2 to >= 4.2, the bug still exists, which may be acceptable given the edge case for the bug to be triggered. To fix the bug for upgrades <4.2, RabbitMQ needs a way to migrate projections. One option is to unregister and register the projection in the rabbitmq_4.2.0 feature flag migration callback. This would require to block routing while this migration function runs.

This commit adds a test case for a regression/bug that occurs in Khepri.
```
make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=mnesia
```
succeeds, but
```
make -C deps/rabbit ct-bindings t=cluster:binding_args RABBITMQ_METADATA_STORE=khepri
```
fails.

The problem is that ETS table `rabbit_khepri_index_route` cannot
differentiate between two bindings with different binding arguments, and
therefore deletes entries too early, leading to wrong routing decisions.

The solution to this bug is to include the binding arguments in the
`rabbit_khepri_index_route` projection, similar to how the binding args
are also included in the `rabbit_index_route` Mnesia table.

This bug/regression is an edge case and exists if the source exchange
type is `direct` or `fanout` and if different bindings arguments are
used by client apps. Note that such binding arguments are entirely
ignored when RabbitMQ performs routing decisions for the `direct` or
`fanout` exchange. However, there might be client apps that use binding
arguments to add some metadata to the binding, for example `app-id` or
`user` or `purpose` and might use this metadata as a form of reference
counting in deciding when to delete `auto-delete` exchanges or just for
informational/operational purposes.
Prior to this commit, this test case failed in 4.2 <-> 4.1 mixed version
mode because the different nodes register different projections.
Specifically, the old projection of the 4.1 node was registered.

Independent of the test case, even when a rolling upgrade from 4.1 to
this commit's branch completes, the old projection is still registered.

It seems, what's missing is a Khepri machine version migration where the
projection will be migrated from old to new. But that's outside the
scope of this bug fix. We can add this mechanism separately.
@ansd ansd marked this pull request as ready for review September 15, 2025 12:10
@dumbbell
Copy link
Collaborator

@ansd says in an internal chat:

The PR fixes this issue for clusters created in 4.2+. However, if you upgrade from 4.1 with Khepri enabled previously, the bug still occurs. That's because the old projection table is still being used. The question is how do you upgrade projections?

One option I can think of: deregister the projection and then register again in the rabbitmq_4.2.0 feature flag migration function. That means we need to block routing though while the migration function runs.

Another option I can think of: run a migration in the machine_version upgrade. But it seems this API is meant to be private to the Khepri app and not meant to be hooked into by the rabbit app.

Yet another option for this specific PR: Provide the fix as is for new clusters created in 4.2+. Given this is a rare edge case, we can leave it unfixed for clusters created in 4.1.

Another option is to create a new projection; in other words, rename the current one from rabbit_khepri_index_route to rabbit_khepri_index_route_v2. The node running RabbitMQ 4.2 will get the new fixed projection and an old node will continue to use the old projection. We could then unregister the old projection in the migration function of the rabbitmq_4.2.0 feature flag. I think this is a good middle ground: the old node continues to suffer from the bug, but as you said, that’s probably ok because it is a rare use case.

@the-mikedavis: Do you have an opinion?

@ansd ansd mentioned this pull request Sep 15, 2025
@the-mikedavis
Copy link
Collaborator

Adding a v2 and eventually unregistering the v1 in rabbitmq_4.2.0 also sounds like a good middle-ground to me 👍

@ansd
Copy link
Member Author

ansd commented Sep 15, 2025

Thank you! Just that I understand correctly: Should the v2 projection be registered upon node boot? This means it becomes available on the old nodes too in a mixed version cluster, but won't cause there any harm, right?

@dumbbell
Copy link
Collaborator

Yes, I confirm that the newer node will initialise the new projection on boot and the ETS table will appear on all nodes, including an old one. That’s ok because the old node won’t use the new projection. The new node just need to support the new projection.

@ansd ansd marked this pull request as draft September 15, 2025 15:59
@ansd
Copy link
Member Author

ansd commented Sep 16, 2025

Closing this PR in favour of #14546

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Regression with Khepri and binding arguments

4 participants