Skip to content

activate() and deactivate need to be done in 2 phases #5

@belaban

Description

@belaban

When activate() is called in {A,B,C}, then all members connect to the UpgradeServer, register their view and set active=true. This may cause the following issue:

  • A and B are done, active is true
  • C is delayed, active is still false and registerView() has not yet been called
  • C sends a message to B. This succeeds because the message is sent via the JGroups stack (not via UPGRADE, as active==false), and B does receive the message via the JGroups stack. However, B would not be able to send a response to C, because it would send it via UPGRADE. However, C would not receive the message as it hasn't yet called registerView(), which enables the UpgradeServer to send B's response to C, as C doesn't yet use UPGRADE.

We therefore need to ensure that everyone is registered with the UpgradeServer, before switching to use of UPGRADE:

  • In a first phase, registerView() in all members makes sure that everyone can send/receive message to/from the UpgradeServer.
  • Only when everyone has successfully registered, we can switch to using UPGRADE by setting active=true. If the first phase doesn't complete successfully, an exception will be thrown and the second phase will not be started, which means that the switch to UPGRADE will not be made.

The second phase does not need to be synchronous: since everyone is connected with the UpgradeServer and JGroups, messages can be sent via JGroups or UPGRADE and will be received all the same! For example, a member might not yet be active, therefore a message is sent via JGroups. The recipient receives it via JGroups, but might send the response via UPGRADE, as it is already active. The original sender will then receive the response via UPGRADE, as it registered with the UpgradeServer in the first phase.

This issue would not cause incorrect behavior in Infinispan, as an RPC would simply time out (e.g. in the above example). However, it reduces the number of failures, which is important when we do a rolling upgrade during heavy traffic.

The same is true for deactivate(): because it is not synchronous (ie., received by all members at the same time), the following can happen:

  • All members (A,B,C) are active
  • deactivate() is called
  • A is the coordinator of the global view and would install the new MergeView locally
  • However, C receives deactivate() first and disconnects
  • Because A is still active, it gets a new view {A,B} from the UpgradeServer!

-> We therefore have to activate/deactivate in 2 phases:

Solution for 'activate()`:

  • All members register with UpgradeServer. Now, they can receive messages either via UpgradeServer or still locally
  • When this is done (and confirmed): all members switch active to false
    ** Because this phase is not synchronous, some members might activate before others. However, this is not an issue as members can receive message temporarily through the local channel before switching to UpgradeServer

Solution for 'deactivate()`:

  • All members set active to false. This means that members send messages via the local channel, but are still able to receive messages via UpgradeServer. However, view changes from UpgradeServer are ignored.
  • When this is done, members disconnect from UpgradeServer. TBD: we need to make sure that a member has no pending messages sent via UpgradeServer. TBD: perhaps don't disconnect a member at all; when restarted without UPGRADE, the connection to UpgradeServer will be torn down anyway

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Relationships

None yet

Development

No branches or pull requests

Issue actions