Skip to content

Conversation

@mergify
Copy link

@mergify mergify bot commented Aug 13, 2024

Transient queue deletion previously caused a crash if Khepri was enabled and a node with a transient queue went down while its cluster was in a minority. We need to handle the {error,timeout} return possible from rabbit_db_queue:delete_transient/1. In the rabbit_amqqueue:on_node_down/1 callback we log a warning when we see this return.

We then try this deletion again during that node's rabbit_khepri:init/0 which is called from a boot step after rabbit_khepri:setup/0. At that point we can return an error and halt the node's boot if the command times out. The cluster is very likely to be in a majority at that point since rabbit_khepri:setup/0 waits for a leader to be elected (requiring a majority).

This fixes a crash report found in the cluster_minority_SUITE's end_per_group.


This is an automatic backport of pull request #11979 done by [Mergify](https://mergify.com).

The prior code skirted transactions because the filter function might
cause Khepri to call itself. We want to use the same idea as the old
code - get all queues, filter them, then delete them - but we want to
perform the deletion in a transaction and fail the transaction if any
queues changed since we read them.

This fixes a bug - that the call to `delete_in_khepri/2` could return
an error tuple that would be improperly recognized as `Deletions` -
but should also make deleting transient queues atomic and fast.
Each call to `delete_in_khepri/2` needed to wait on Ra to replicate
because the deletion is an individual command sent from one process.
Performing all deletions at once means we only need to wait for one
command to be replicated across the cluster.

We also bubble up any errors to delete now rather than storing them as
deletions. This fixes a crash that occurs on node down when Khepri is
in a minority.

(cherry picked from commit 0dd26f0)
Transient queue deletion previously caused a crash if Khepri was enabled
and a node with a transient queue went down while its cluster was in a
minority. We need to handle the `{error,timeout}` return possible from
`rabbit_db_queue:delete_transient/1`. In the
`rabbit_amqqueue:on_node_down/1` callback we log a warning when we see
this return.

We then try this deletion again during that node's
`rabbit_khepri:init/0` which is called from a boot step after
`rabbit_khepri:setup/0`. At that point we can return an error and halt
the node's boot if the command times out. The cluster is very likely to
be in a majority at that point since `rabbit_khepri:setup/0` waits for
a leader to be elected (requiring a majority).

This fixes a crash report found in the `cluster_minority_SUITE`'s
`end_per_group`.

(cherry picked from commit 3f734ef)
@the-mikedavis the-mikedavis merged commit 3a26277 into v4.0.x Aug 13, 2024
@the-mikedavis the-mikedavis deleted the mergify/bp/v4.0.x/pr-11979 branch August 13, 2024 19:57
michaelklishin added a commit that referenced this pull request Aug 14, 2024
Handle transient queue deletion in Khepri minority (backport #11979) (backport #11990)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants