In the HPE Swarm Learning sync stage (defined by sync frequency), when it is time to share the learning from the individual model, one of the Swarm Learning (SL) nodes is designated as the “leader” node. This leader node collects the individual models from each peer node and merges them into a single model by combining parameters of all the individuals. The **Leader Failure Detection and Recovery (LFDR)** feature enables SL nodes to continue Swarm training during the merging process when an SL leader node fails. A new SL leader node is selected to continue the merging process. If the failed SL leader node comes back after the new SL leader node is in action, the failed SL leader node is treated as a normal SL node and contributes its learning to the swarm global model.
0 commit comments