-
Notifications
You must be signed in to change notification settings - Fork 0
Description
When activate() is called in {A,B,C}, then all members connect to the UpgradeServer, register their view and set active=true. This may cause the following issue:
- A and B are done,
activeistrue - C is delayed,
activeis still false andregisterView()has not yet been called - C sends a message to B. This succeeds because the message is sent via the JGroups stack (not via
UPGRADE, asactive==false), and B does receive the message via the JGroups stack. However, B would not be able to send a response to C, because it would send it viaUPGRADE. However, C would not receive the message as it hasn't yet calledregisterView(), which enables the UpgradeServer to send B's response to C, as C doesn't yet useUPGRADE.
We therefore need to ensure that everyone is registered with the UpgradeServer, before switching to use of UPGRADE:
- In a first phase,
registerView()in all members makes sure that everyone can send/receive message to/from the UpgradeServer. - Only when everyone has successfully registered, we can switch to using
UPGRADEby settingactive=true. If the first phase doesn't complete successfully, an exception will be thrown and the second phase will not be started, which means that the switch toUPGRADEwill not be made.
The second phase does not need to be synchronous: since everyone is connected with the UpgradeServer and JGroups, messages can be sent via JGroups or UPGRADE and will be received all the same! For example, a member might not yet be active, therefore a message is sent via JGroups. The recipient receives it via JGroups, but might send the response via UPGRADE, as it is already active. The original sender will then receive the response via UPGRADE, as it registered with the UpgradeServer in the first phase.
This issue would not cause incorrect behavior in Infinispan, as an RPC would simply time out (e.g. in the above example). However, it reduces the number of failures, which is important when we do a rolling upgrade during heavy traffic.
The same is true for deactivate(): because it is not synchronous (ie., received by all members at the same time), the following can happen:
- All members (A,B,C) are active
deactivate()is calledAis the coordinator of the global view and would install the new MergeView locally- However,
Creceivesdeactivate()first and disconnects - Because
Ais still active, it gets a new view{A,B}from the UpgradeServer!
-> We therefore have to activate/deactivate in 2 phases:
Solution for 'activate()`:
- All members register with UpgradeServer. Now, they can receive messages either via UpgradeServer or still locally
- When this is done (and confirmed): all members switch
activetofalse
** Because this phase is not synchronous, some members might activate before others. However, this is not an issue as members can receive message temporarily through the local channel before switching to UpgradeServer
Solution for 'deactivate()`:
- All members set
activetofalse. This means that members send messages via the local channel, but are still able to receive messages via UpgradeServer. However, view changes from UpgradeServer are ignored. - When this is done, members disconnect from UpgradeServer. TBD: we need to make sure that a member has no pending messages sent via UpgradeServer. TBD: perhaps don't disconnect a member at all; when restarted without UPGRADE, the connection to UpgradeServer will be torn down anyway