You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: support/azure/virtual-machines/linux/troubleshoot-rhel-pacemaker-cluster-services-resources-startup-issues.md
+84Lines changed: 84 additions & 0 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -659,6 +659,90 @@ Because of incorrect `InstanceName` and `START_PROFILE` attributes, the SAP inst
659
659
sudo pcs property set maintenance-mode=false
660
660
```
661
661
662
+
## Scenario 5: Fenced Node Fails to Rejoin Cluster
663
+
664
+
### Symptom for scenario 5
665
+
666
+
Once the fencing operation is complete, the affected node typically doesn't rejoin the pacemaker cluster, and both the pacemaker and corosync services remain stopped unless they are manually started to resume the cluster back online.
667
+
668
+
### Cause for scenario 5
669
+
670
+
After the node was fenced, rebooted, and restarted its cluster services, it subsequently received a message stating "We were allegedly just fenced", which caused it to shut down its pacemaker and corosync services and prevented the cluster from starting. Node1 initiated a STONITH action against node2, and at `03:27:23`, when the network issue was resolved, node2 rejoined the corosync membership. Consequently, a new two-node membership was established, as shown in`/var/log/messages`for node1.
671
+
672
+
```bash
673
+
Feb 20 03:26:56 node1 corosync[1722]: [TOTEM ] A processor failed, forming new configuration.
674
+
Feb 20 03:27:23 node1 corosync[1722]: [TOTEM ] A new membership (1.116f4) was formed. Members left: 2
675
+
Feb 20 03:27:24 node1 corosync[1722]: [QUORUM] Members[1]: 1
676
+
...
677
+
Feb 20 03:27:24 node1 pacemaker-schedulerd[1739]: warning: Cluster node node2 will be fenced: peer is no longer part of the cluster
678
+
...
679
+
Feb 20 03:27:24 node1 pacemaker-fenced[1736]: notice: Delaying 'reboot' action targeting node2 using for 20s
680
+
Feb 20 03:27:25 node1 corosync[1722]: [TOTEM ] A new membership (1.116f8) was formed. Members joined: 2
681
+
Feb 20 03:27:25 node1 corosync[1722]: [QUORUM] Members[2]: 1 2
682
+
Feb 20 03:27:25 node1 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service.
683
+
```
684
+
685
+
node1 received confirmation that node2 had been successfully rebooted as shown in`/var/log/messages`for node2.
686
+
687
+
```bash
688
+
Feb 20 03:27:46 node1 pacemaker-fenced[1736]: notice: Operation 'reboot' [43895] (call 28 from pacemaker-controld.1740) targeting node2 using xvm2 returned 0 (OK)
689
+
```
690
+
691
+
To fully complete the STONITH action, the system needed to deliver the confirmation message to every node. Since node2 rejoined the group at `03:27:25` and no new membership excluding node2 had yet been formed due to the token and consensus timeouts not having expired, the confirmation message was delayed until node2 restarted its cluster services after boot. Upon receiving the message, node2 recognized that it had been fenced and consequently shut down its services as shown:
692
+
693
+
`/var/log/messages' in node1:
694
+
```bash
695
+
Feb 20 03:29:02 node1 corosync[1722]: [TOTEM ] A processor failed, forming new configuration.
696
+
Feb 20 03:29:10 node1 corosync[1722]: [TOTEM ] A new membership (1.116fc) was formed. Members joined: 2 left: 2
697
+
Feb 20 03:29:10 node1 corosync[1722]: [QUORUM] Members[2]: 1 2
698
+
Feb 20 03:29:10 node1 pacemaker-fenced[1736]: notice: Operation 'reboot' targeting node2 by node1 for pacemaker-controld.1740@node1: OK
699
+
Feb 20 03:29:10 node1 pacemaker-controld[1740]: notice: Peer node2 was terminated (reboot) by node1 on behalf of pacemaker-controld.1740: OK
700
+
...
701
+
Feb 20 03:29:11 node1 corosync[1722]: [CFG ] Node 2 was shut down by sysadmin
702
+
Feb 20 03:29:11 node1 corosync[1722]: [TOTEM ] A new membership (1.11700) was formed. Members left: 2
703
+
Feb 20 03:29:11 node1 corosync[1722]: [QUORUM] Members[1]: 1
704
+
Feb 20 03:29:11 node1 corosync[1722]: [MAIN ] Completed service synchronization, ready to provide service.
705
+
```
706
+
707
+
`/var/log/messages'in node2:
708
+
```bash
709
+
Feb 20 03:29:11 [1155] node2 corosync notice [TOTEM ] A new membership (1.116fc) was formed. Members joined: 1
Feb 20 03:29:09 node2 pacemaker-controld [1323] (tengine_stonith_notify) crit: We were allegedly just fenced by node1 for node1!
712
+
```
713
+
714
+
### Resolution for scenario 5
715
+
716
+
Configure a startup delay forthe corosync service. This pause provides sufficient time for a new CPG membership to form and excluding the fenced node, so that the STONITH reboot process can complete by ensuring the completion message reaches all nodesin the membership.
717
+
718
+
To achieve this, executing the following commands:
719
+
720
+
1. Put the cluster into maintenance mode:
721
+
722
+
```bash
723
+
sudo pcs property set maintenance-mode=true
724
+
```
725
+
2. Create a systemd drop-in file on all the nodes in the cluster:
726
+
727
+
- Edit the corosync file:
728
+
```bash
729
+
sudo systemctl edit corosync.service
730
+
```
731
+
- Add the following lines:
732
+
```config
733
+
[Service]
734
+
ExecStartPre=/bin/sleep 60
735
+
```
736
+
- After saving and exiting the text editor, reload the systemd manager configuration with:
737
+
```bash
738
+
sudo systemctl daemon-reload
739
+
```
740
+
3. Remove the cluster out of maintenance mode:
741
+
```bash
742
+
sudo pcs property set maintenance-mode=false
743
+
```
744
+
For more information refer to [Fenced Node Fails to Rejoin Cluster Without Manual Intervention](https://access.redhat.com/solutions/5644441)
745
+
662
746
## Next steps
663
747
664
748
For additional help, open a support request by using the following instructions. When you submit your request, attach the [SOS report](https://access.redhat.com/solutions/3592) from all the nodes in the cluster for troubleshooting.
0 commit comments