[ceph_migrate] trigger mgr failover when cluster health is degraded #1127

rebtoor · 2025-11-04T15:06:17Z

Add a task in post.yaml to trigger Ceph manager failover when the cluster health status is not HEALTH_OK. This helps recover from degraded cluster states after migration operations.

The task installs the cephadm package on all ComputeHCI nodes and then executes cephadm shell -- ceph mgr fail on the first compute node. This approach avoids container-based CLI complexity and uses the native cephadm tool available on compute nodes where Ceph daemons are running.

The task only runs when ComputeHCI nodes are available and the cluster health is degraded (HEALTH_WARN or HEALTH_ERR).

katarimanojk · 2025-11-04T16:04:16Z

/lgtm

jistr · 2025-11-07T16:05:20Z

/retest-required
/lgtm

fmount · 2025-11-07T19:52:48Z

we might need to rebase this patch to get the linting fix
upstream test happening here and early results look good

After we rebase and we have a green downstream unigamma I think we can land and backport this patch to unblock CI.

openshift-ci · 2025-11-10T08:38:44Z

New changes are detected. LGTM label has been removed.

openshift-ci · 2025-11-10T08:38:47Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jistr. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

rebtoor · 2025-11-10T17:01:57Z

I marked the PR as draft because there are points that i need to discuss with @fmount first to ensure robustness of the procedure.

softwarefactory-project-zuul · 2025-11-11T19:29:44Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a6bc2713a27e4392ad35a38b052d75ff

✔️ noop SUCCESS in 0s
❌ adoption-standalone-to-crc-ceph FAILURE in 2h 19m 13s
❌ adoption-standalone-to-crc-no-ceph FAILURE in 2h 24m 24s

softwarefactory-project-zuul · 2025-11-14T12:15:00Z

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5e74b248cbf849bcb51544e4395ba3fd

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 03m 21s
❌ adoption-standalone-to-crc-no-ceph NODE_FAILURE Node request 100-0008086801 failed in 0s

rebtoor · 2025-11-14T12:45:04Z

recheck

Add a task in post.yaml to trigger Ceph manager failover when the cluster health status is not HEALTH_OK. This helps recover from degraded cluster states after migration operations. The task installs the cephadm package on all ComputeHCI nodes and then executes 'cephadm shell -- ceph mgr fail' on the first compute node. This approach avoids container-based CLI complexity and uses the native cephadm tool available on compute nodes where Ceph daemons are running. The task only runs when ComputeHCI nodes are available and the cluster health is degraded (HEALTH_WARN or HEALTH_ERR). Signed-off-by: Roberto Alfieri <[email protected]>

rebtoor requested review from a team, fmount and katarimanojk November 4, 2025 15:06

openshift-ci bot assigned katarimanojk Nov 4, 2025

openshift-ci bot added the lgtm label Nov 4, 2025

rebtoor force-pushed the ceph-migrate-fix branch from 86d4a6d to fc7df71 Compare November 4, 2025 16:15

openshift-ci bot removed the lgtm label Nov 4, 2025

openshift-ci bot assigned jistr Nov 7, 2025

openshift-ci bot added the lgtm label Nov 7, 2025

rebtoor force-pushed the ceph-migrate-fix branch from fc7df71 to 6d41a6a Compare November 10, 2025 08:38

openshift-ci bot removed the lgtm label Nov 10, 2025

rebtoor marked this pull request as draft November 10, 2025 17:01

openshift-ci bot added the do-not-merge/work-in-progress label Nov 10, 2025

rebtoor force-pushed the ceph-migrate-fix branch 6 times, most recently from b8e070b to c44b232 Compare November 11, 2025 17:04

rebtoor force-pushed the ceph-migrate-fix branch from c44b232 to 46909aa Compare November 13, 2025 09:34

rebtoor changed the title ~~ceph_migrate refactor restart mgr handler as task to fix role usage~~ [ceph_migrate] trigger mgr failover when cluster health is degraded Nov 13, 2025

rebtoor marked this pull request as ready for review November 13, 2025 09:41

openshift-ci bot removed the do-not-merge/work-in-progress label Nov 13, 2025

rebtoor force-pushed the ceph-migrate-fix branch 2 times, most recently from f7c69e3 to 3321964 Compare November 14, 2025 09:10

rebtoor force-pushed the ceph-migrate-fix branch from 3321964 to cca2563 Compare November 14, 2025 15:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[ceph_migrate] trigger mgr failover when cluster health is degraded #1127

[ceph_migrate] trigger mgr failover when cluster health is degraded #1127

rebtoor commented Nov 4, 2025 •

edited

Loading

Uh oh!

katarimanojk commented Nov 4, 2025

Uh oh!

jistr commented Nov 7, 2025

Uh oh!

fmount commented Nov 7, 2025

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

rebtoor commented Nov 10, 2025

Uh oh!

softwarefactory-project-zuul bot commented Nov 11, 2025

Uh oh!

softwarefactory-project-zuul bot commented Nov 14, 2025

Uh oh!

rebtoor commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

[ceph_migrate] trigger mgr failover when cluster health is degraded #1127

Are you sure you want to change the base?

[ceph_migrate] trigger mgr failover when cluster health is degraded #1127

Conversation

rebtoor commented Nov 4, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

katarimanojk commented Nov 4, 2025

Uh oh!

jistr commented Nov 7, 2025

Uh oh!

fmount commented Nov 7, 2025

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

openshift-ci bot commented Nov 10, 2025

Uh oh!

rebtoor commented Nov 10, 2025

Uh oh!

softwarefactory-project-zuul bot commented Nov 11, 2025

Uh oh!

softwarefactory-project-zuul bot commented Nov 14, 2025

Uh oh!

rebtoor commented Nov 14, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

rebtoor commented Nov 4, 2025 •

edited

Loading