Skip to content

Conversation

@acogoluegnes
Copy link
Contributor

[Why]
When re-evaluating single active consumer groups because of the DOWN connection message, the stream SAC coordinator may register mod_call RA effects to send messages to other connections. These messages aims at activating or deactivating consumers.

This works fine if one connection at a time dies, but when a node goes down, several connections may go down and the stream SAC coordinator may send messages to these dead connections. SAC groups can then get stuck, with only inactive consumers. This is because the coordinator considers only one connection during a group evaluation.

[How]
While evaluating the consumers of a SAC group after a DOWN message, the stream SAC coordinator not only remove the consumers of the "current" dead connection, but also checks if the consumer connections in the group are still alive and remove the consumers accordingly.

The consumers are preemptively removed from the group and so not considered during the evaluation of the new active consumer.

[Why]
When re-evaluating single active consumer groups because of the DOWN
connection message, the stream SAC coordinator may register `mod_call`
RA effects to send messages to other connections. These messages aims at
activating or deactivating consumers.

This works fine if one connection at a time dies, but when a node goes
down, several connections may go down and the stream SAC coordinator may
send messages to these dead connections. SAC groups can then get stuck,
with only inactive consumers. This is because the coordinator considers
only one connection during a group evaluation.

[How]
While evaluating the consumers of a SAC group after a DOWN message,
the stream SAC coordinator not only remove the consumers of the "current"
dead connection, but also checks if the consumer connections in the group
are still alive and remove the consumers accordingly.

The consumers are preemptively removed from the group and so not
considered during the evaluation of the new active consumer.
@acogoluegnes acogoluegnes added this to the 4.1.0 milestone Apr 2, 2025
@acogoluegnes
Copy link
Contributor Author

Checking process aliveness is non-deterministic, so it cannot be used in the machine. Closing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants