Improve 'cluster health sbd' command #2003

liangxin1300 · 2025-12-24T09:13:22Z

Problems

This PR is going to address the issues below brought by #1932:

Cluster members are SSH unreachable
Functionalities are aborted due to SSH unreachable

# crm cluster health sbd
ERROR: cluster.health: Failed on root@sle16-2: Cannot create SSH connection to root@sle16-2: ssh: connect to host sle16-2 port 22: Connection refused
Connection closed

Failed to configure crashdump-watchdog-timeout

# crm sbd configure crashdump-watchdog-timeout=30
INFO: Set crashdump option for fence_sbd resource
INFO: Set msgwait-timeout to 2*watchdog-timeout + crashdump-watchdog-timeout: 60
INFO: Update SBD_TIMEOUT_ACTION in /etc/sysconfig/sbd: flush,crashdump
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_OPTS in /etc/sysconfig/sbd: -C 30
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Adjusting sbd msgwait to 90, watchdog timeout to 15
ERROR: sbd.configure: Failed to run 'crm sbd configure msgwait-timeout=90 watchdog-timeout=15': ERROR: sbd.configure: Failed to run 'crm sbd configure msgwait-timeout=120 watchdog-timeout=15':

The current minimum value of watchdog timeout is always 5

Changes

Cluster members are SSH unreachable
- Give a warning to skip the consistency check
- Perform checking functionalities, raise warning/error when checking fails
- Give an error that the issue can't be fixed due to SSH unreachable

# crm cluster health sbd
WARNING: Skip configuration consistency check due to unreachable nodes
ERROR: It's recommended that SBD_DELAY_START is set to 71, now is no
ERROR: Cannot fix SBD_DELAY_START issue: There are nodes whose SSH ports are unreachable: sle16-2.
Please check the network connectivity before check and fix SBD timeout configurations.
ERROR: SBD: Check sbd timeout configuration: FAIL.

sbd: Calculate expected msgwait timeout correctly with crashdump timeout

# crm sbd configure crashdump-watchdog-timeout=30
INFO: Set crashdump option for fence_sbd resource
INFO: Set msgwait-timeout to 2*watchdog-timeout + crashdump-watchdog-timeout: 60
INFO: Update SBD_TIMEOUT_ACTION in /etc/sysconfig/sbd: flush,crashdump
INFO: Configuring disk-based SBD
INFO: Initializing SBD device /dev/sda5
INFO: Update SBD_OPTS in /etc/sysconfig/sbd: -C 30
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Update SBD_DELAY_START in /etc/sysconfig/sbd: 101
INFO: Sync file /etc/sysconfig/sbd to sle16-2
INFO: Adjusting systemd start timeout for sbd.service to 121s
INFO: Sync directory /etc/systemd/system/sbd.service.d to sle16-2
INFO: Adjusting stonith-timeout to 83
WARNING: "stonith-timeout" in crm_config is set to 83, it was 71
INFO: Restarting cluster service
INFO: BEGIN Waiting for cluster
.......                                                                                                                                                                                                                      
INFO: END Waiting for cluster

ui_sbd: Get minimum timeout value dynamically

# minimum watchdog timeout is from profiles.yml
# crm sbd configure watchdog-timeout=1
ERROR: sbd.configure: The minimum value for watchdog-timeout is 15
# crm sbd configure msgwait-timeout=2
ERROR: sbd.configure: The minimum value for msgwait-timeout is 30

# when no watchdog timeout is defined in profiles.yml
# crm sbd configure watchdog-timeout=1
ERROR: sbd.configure: The minimum value for watchdog-timeout is 5
# crm sbd configure msgwait-timeout=2
ERROR: sbd.configure: The minimum value for msgwait-timeout is 10

sbd: Check and fix drop-in file which to unset SBD_DELAY_START
Fix: crmsh could aggressively populate all nodes with /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf #2001

# crm cluster health sbd
WARNING: Runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START is missing
INFO: Please run "crm cluster health sbd --fix" to fix the above warning
INFO: SBD: Check sbd timeout configuration: OK.

# crm cluster health sbd --fix
INFO: Createing runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START
INFO: SBD: Check sbd timeout configuration: OK.

Show multi errors or warnings at once if detected

# crm sbd configure show
...
ERROR: It's recommended that SBD_DELAY_START is set to 131, now is 31
ERROR: It's recommended that stonith-timeout is set to 119, now is 19
WARNING: Runtime drop-in file /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf to unset SBD_DELAY_START is missing
INFO: Please run "crm cluster health sbd --fix" to fix the above error

Now the check items include:
- /etc/corosync/corosync.conf consistency
- /etc/sysconfig/sbd consistency
- SBD disk metadata
- SBD devices metadata consistency
- SBD_WATCHDOG_TIMEOUT
- SBD_DELAY_START
- systemd start timeout for sbd.service
- stonith-watchdog-timeout property
- stonith-timeout property
- unset SBD_DELAY_START in drop-in file
- sbd.service should be enabled
- fence_sbd agent

codecov · 2025-12-24T15:02:33Z

Codecov Report

❌ Patch coverage is 72.36025% with 89 lines in your changes missing coverage. Please review.
✅ Project coverage is 70.33%. Comparing base (0de9df8) to head (040914e).

Files with missing lines	Patch %	Lines
crmsh/sbd.py	71.49%	65 Missing ⚠️
crmsh/utils.py	64.58%	17 Missing ⚠️
crmsh/ui_cluster.py	0.00%	3 Missing ⚠️
crmsh/ui_sbd.py	92.59%	2 Missing ⚠️
crmsh/xmlutil.py	50.00%	2 Missing ⚠️

Additional details and impacted files

Flag	Coverage Δ
integration	`55.20% <24.22%> (-0.22%)`	⬇️
unit	`52.57% <68.63%> (+0.10%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines	Coverage Δ
crmsh/bootstrap.py	`87.93% <100.00%> (+0.05%)`	⬆️
crmsh/ui_sbd.py	`83.72% <92.59%> (+0.35%)`	⬆️
crmsh/xmlutil.py	`70.20% <50.00%> (-0.09%)`	⬇️
crmsh/ui_cluster.py	`71.32% <0.00%> (+0.49%)`	⬆️
crmsh/utils.py	`66.94% <64.58%> (+0.04%)`	⬆️
crmsh/sbd.py	`80.12% <71.49%> (-2.98%)`	⬇️

... and 1 file with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…eouts (jsc#PED-14995) - Check configurations' consistency only when all nodes are reachable - For those SSH required checkings, raise FixAbortedDueToUnreachableNode when some nodes are unreachable

… timeout

Collect dead nodes, unreachable nodes, need password nodes and reachable nodes at once when doing the check.

…e nodes And move SBD devices metadata consistency check as a separate check step

…ists on all nodes

…status' as well

zzhou1

Very good UX improvement!

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 3 times, most recently from 513e243 to 36b95c9 Compare December 24, 2025 13:54

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 0e8b920 to f37d1df Compare December 27, 2025 12:27

liangxin1300 mentioned this pull request Dec 27, 2025

crmsh could aggressively populate all nodes with /run/systemd/system/sbd.service.d/sbd_delay_start_disabled.conf #2001

Closed

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 4 times, most recently from 3b67ead to a4d8674 Compare December 29, 2025 08:53

liangxin1300 marked this pull request as ready for review December 29, 2025 14:01

liangxin1300 requested review from nicholasyang2022 and zzhou1 December 29, 2025 14:01

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 5 times, most recently from d47ea36 to e38d239 Compare January 5, 2026 11:24

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 6 times, most recently from 18bab8a to 52d46c9 Compare January 6, 2026 14:51

liangxin1300 added 5 commits January 8, 2026 10:46

Dev: sbd: Check all nodes are reachable when checking SBD-related tim…

eede80f

…eouts (jsc#PED-14995) - Check configurations' consistency only when all nodes are reachable - For those SSH required checkings, raise FixAbortedDueToUnreachableNode when some nodes are unreachable

Dev: sbd: check SBD_DELAY_START for non integer case

cb0f055

Dev: sbd: Calculate expected msgwait timeout correctly with crashdump…

c94bb1d

… timeout

Dev: ui_sbd: Get minimum timeout value dynamically

f148c5b

Dev: behave: Adjust functional test for previous commit

7e824a8

liangxin1300 added 11 commits January 8, 2026 10:46

Dev: sbd: Check and fix drop-in file which to unset SBD_DELAY_START

6560b8e

Dev: utils: Refactor utils.check_all_nodes_reachable

ec8e941

Collect dead nodes, unreachable nodes, need password nodes and reachable nodes at once when doing the check.

Dev: sbd: Check configurations consistency only if there are reachabl…

3523529

…e nodes And move SBD devices metadata consistency check as a separate check step

Dev: sbd: Show diff output after error output when checking consistency

8e7a0f0

Dev: sbd: Enable to fix devices metadata consistency issue

d36cbcd

Dev: sbd: Show multi errors or warnings at once if detected

e3dc3a7

Dev: sbd: Check if the drop-in file which to unset SBD_DELAY_START ex…

b6145ef

…ists on all nodes

Dev: sbd: Check sbd systemd start timeout on all nodes

6563036

Dev: sbd: Ignore comment line and blank line when checking consistency

7e1e249

Dev: ui_sbd: Do sbd timeout-related configurations check on 'crm sbd …

4812922

…status' as well

Dev: sbd: Check if sbd.service is enabled on all nodes

fae0ef7

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from ef95154 to bae3596 Compare January 8, 2026 13:32

Dev: sbd: Check and fix fence_sbd agent

21ef350

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 4329847 to bc5d9b2 Compare January 11, 2026 14:19

Dev: sbd: Add debug log for sbd checking results

cb7ee9d

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch 2 times, most recently from 040914e to e03dccb Compare January 12, 2026 02:00

Dev: sbd: Refactor to enable checking when cluster is down

f497c52

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch from e03dccb to 1add080 Compare January 12, 2026 03:14

Dev: unittests: Adjust unit test for previous commits

f19f855

liangxin1300 force-pushed the 20251223_cluster_health_sbd_improve branch from 1add080 to f19f855 Compare January 12, 2026 06:39

zzhou1 approved these changes Jan 12, 2026

View reviewed changes

liangxin1300 merged commit 268a7f4 into ClusterLabs:master Jan 12, 2026
63 of 70 checks passed

This was referenced Jan 13, 2026

DOCTEAM-1985 diskless SBD SUSE/doc-modular#629

Merged

Don't set stonith-enabled=true after adding sbd via stage #2011

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve 'cluster health sbd' command #2003

Improve 'cluster health sbd' command #2003

Uh oh!

liangxin1300 commented Dec 24, 2025 •

edited

Loading

Uh oh!

codecov bot commented Dec 24, 2025 •

edited

Loading

Uh oh!

zzhou1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Improve 'cluster health sbd' command #2003

Improve 'cluster health sbd' command #2003

Uh oh!

Conversation

liangxin1300 commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Problems

Changes

Uh oh!

codecov bot commented Dec 24, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

zzhou1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

liangxin1300 commented Dec 24, 2025 •

edited

Loading

codecov bot commented Dec 24, 2025 •

edited

Loading