Skip to content

Conversation

@liangxin1300
Copy link
Collaborator

@liangxin1300 liangxin1300 commented Dec 24, 2025

Problems

This PR is going to address the issues below brought by #1932:

  • Cluster members are SSH unreachable
    Functionalities are aborted due to SSH unreachable
# crm cluster health sbd
ERROR: cluster.health: Failed on root@sle16-2: Cannot create SSH connection to root@sle16-2: ssh: connect to host sle16-2 port 22: Connection refused
Connection closed
  • Failed to configure crashdump-watchdog-timeout
# crm sbd configure crashdump-watchdog-timeout=30
INFO: Set crashdump option for fence_sbd resource
INFO: Set msgwait-timeout to 2*watchdog-timeout + crashdump-watchdog-timeout: 60
INFO: Update SBD_TIMEOUT_ACTION in /etc/sysconfig/sbd: flush,crashdump
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_OPTS in /etc/sysconfig/sbd: -C 30
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Adjusting sbd msgwait to 90, watchdog timeout to 15
ERROR: sbd.configure: Failed to run 'crm sbd configure msgwait-timeout=90 watchdog-timeout=15': ERROR: sbd.configure: Failed to run 'crm sbd configure msgwait-timeout=120 watchdog-timeout=15':
  • The current minimum value of watchdog timeout is always 5

Changes

  • Cluster members are SSH unreachable
    • Give a warning to skip the consistency check
    • Perform checking functionalities, raise warning/error when checking fails
    • Give an error that the issue can't be fixed due to SSH unreachable
# crm cluster health sbd
WARNING: Skip configuration consistency check due to unreachable nodes
ERROR: It's recommended that SBD_DELAY_START is set to 71, now is no
ERROR: Cannot fix SBD_DELAY_START issue: There are nodes whose SSH ports are unreachable: sle16-2.
Please check the network connectivity before check and fix SBD timeout configurations.
ERROR: SBD: Check sbd timeout configuration: FAIL.
  • sbd: Calculate expected msgwait timeout correctly with crashdump timeout
# crm sbd configure crashdump-watchdog-timeout=30
INFO: Set crashdump option for fence_sbd resource
INFO: Set msgwait-timeout to 2*watchdog-timeout + crashdump-watchdog-timeout: 60
INFO: Update SBD_TIMEOUT_ACTION in /etc/sysconfig/sbd: flush,crashdump
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_OPTS in /etc/sysconfig/sbd: -C 30
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 101
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Adjusting systemd start timeout for sbd.service to 121s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: Adjusting stonith-timeout to 83
WARNING: "stonith-timeout" in crm_config is set to 83, it was 71
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
.......                                                                                                                                                                                                                      
INFO: END Waiting for cluster
  • ui_sbd: Get minimum timeout value dynamically
# minimum watchdog timeout is from profiles.yml
# crm sbd configure watchdog-timeout=1
ERROR: sbd.configure: The minimum value for watchdog-timeout is 15
# crm sbd configure msgwait-timeout=2
ERROR: sbd.configure: The minimum value for msgwait-timeout is 30

# when no watchdog timeout is defined in profiles.yml
# crm sbd configure watchdog-timeout=1
ERROR: sbd.configure: The minimum value for watchdog-timeout is 5
# crm sbd configure msgwait-timeout=2
ERROR: sbd.configure: The minimum value for msgwait-timeout is 10
# crm cluster health sbd
WARNING: Runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START is missing
INFO: Please run "crm cluster health sbd --fix" to fix the above warning
INFO: SBD: Check sbd timeout configuration: OK.

# crm cluster health sbd --fix
INFO: Createing runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START
INFO: SBD: Check sbd timeout configuration: OK.
  • Show multi errors or warnings at once if detected
# crm sbd configure show
...
ERROR: It's recommended that SBD_DELAY_START is set to 131, now is 31
ERROR: It's recommended that stonith-timeout is set to 119, now is 19
WARNING: Runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START is missing
INFO: Please run "crm cluster health sbd --fix" to fix the above error
  • Now the check items include:
    • /etc/corosync/corosync.conf consistency
    • /etc/sysconfig/sbd consistency
    • SBD disk metadata
    • SBD devices metadata consistency
    • SBD_WATCHDOG_TIMEOUT
    • SBD_DELAY_START
    • systemd start timeout for sbd.service
    • stonith-watchdog-timeout property
    • stonith-timeout property
    • unset SBD_DELAY_START in drop-in file
    • sbd.service should be enabled
    • fence_sbd agent

@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 3 times, most recently from 513e243 to 36b95c9 Compare December 24, 2025 13:54
@codecov
Copy link

codecov bot commented Dec 24, 2025

Codecov Report

❌ Patch coverage is 72.36025% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.33%. Comparing base (0de9df8) to head (040914e).

Files with missing lines Patch % Lines
crmsh/sbd.py 71.49% 65 Missing ⚠️
crmsh/utils.py 64.58% 17 Missing ⚠️
crmsh/ui_cluster.py 0.00% 3 Missing ⚠️
crmsh/ui_sbd.py 92.59% 2 Missing ⚠️
crmsh/xmlutil.py 50.00% 2 Missing ⚠️
Additional details and impacted files
Flag Coverage Δ
integration 55.20% <24.22%> (-0.22%) ⬇️
unit 52.57% <68.63%> (+0.10%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
crmsh/bootstrap.py 87.93% <100.00%> (+0.05%) ⬆️
crmsh/ui_sbd.py 83.72% <92.59%> (+0.35%) ⬆️
crmsh/xmlutil.py 70.20% <50.00%> (-0.09%) ⬇️
crmsh/ui_cluster.py 71.32% <0.00%> (+0.49%) ⬆️
crmsh/utils.py 66.94% <64.58%> (+0.04%) ⬆️
crmsh/sbd.py 80.12% <71.49%> (-2.98%) ⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 0e8b920 to f37d1df Compare December 27, 2025 12:27
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 4 times, most recently from 3b67ead to a4d8674 Compare December 29, 2025 08:53
@liangxin1300 liangxin1300 marked this pull request as ready for review December 29, 2025 14:01
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 5 times, most recently from d47ea36 to e38d239 Compare January 5, 2026 11:24
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 6 times, most recently from 18bab8a to 52d46c9 Compare January 6, 2026 14:51
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from ef95154 to bae3596 Compare January 8, 2026 13:32
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 4329847 to bc5d9b2 Compare January 11, 2026 14:19
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 040914e to e03dccb Compare January 12, 2026 02:00
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch from e03dccb to 1add080 Compare January 12, 2026 03:14
@liangxin1300 liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch from 1add080 to f19f855 Compare January 12, 2026 06:39
Copy link
Member

@zzhou1 zzhou1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very good UX improvement!

@liangxin1300 liangxin1300 merged commit 268a7f4 into ClusterLabs:master Jan 12, 2026
63 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

crmsh could aggressively populate all nodes with /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf

2 participants