Skip to content

Conversation

@rebtoor
Copy link
Contributor

@rebtoor rebtoor commented Nov 4, 2025

Add a task in post.yaml to trigger Ceph manager failover when the cluster health status is not HEALTH_OK. This helps recover from degraded cluster states after migration operations.

The task installs the cephadm package on all ComputeHCI nodes and then executes cephadm shell -- ceph mgr fail on the first compute node. This approach avoids container-based CLI complexity and uses the native cephadm tool available on compute nodes where Ceph daemons are running.

The task only runs when ComputeHCI nodes are available and the cluster health is degraded (HEALTH_WARN or HEALTH_ERR).

@rebtoor rebtoor requested review from a team, fmount and katarimanojk November 4, 2025 15:06
@katarimanojk
Copy link
Contributor

/lgtm

@jistr
Copy link
Contributor

jistr commented Nov 7, 2025

/retest-required
/lgtm

@openshift-ci openshift-ci bot added the lgtm label Nov 7, 2025
@fmount
Copy link
Contributor

fmount commented Nov 7, 2025

  • we might need to rebase this patch to get the linting fix
  • upstream test happening here and early results look good

After we rebase and we have a green downstream unigamma I think we can land and backport this patch to unblock CI.

@openshift-ci openshift-ci bot removed the lgtm label Nov 10, 2025
@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2025

New changes are detected. LGTM label has been removed.

@openshift-ci
Copy link

openshift-ci bot commented Nov 10, 2025

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by:
Once this PR has been reviewed and has the lgtm label, please ask for approval from jistr. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@rebtoor rebtoor marked this pull request as draft November 10, 2025 17:01
@rebtoor
Copy link
Contributor Author

rebtoor commented Nov 10, 2025

I marked the PR as draft because there are points that i need to discuss with @fmount first to ensure robustness of the procedure.

@rebtoor rebtoor force-pushed the ceph-migrate-fix branch 6 times, most recently from b8e070b to c44b232 Compare November 11, 2025 17:04
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/a6bc2713a27e4392ad35a38b052d75ff

✔️ noop SUCCESS in 0s
adoption-standalone-to-crc-ceph FAILURE in 2h 19m 13s
adoption-standalone-to-crc-no-ceph FAILURE in 2h 24m 24s

@rebtoor rebtoor changed the title ceph_migrate refactor restart mgr handler as task to fix role usage [ceph_migrate] trigger mgr failover when cluster health is degraded Nov 13, 2025
@rebtoor rebtoor marked this pull request as ready for review November 13, 2025 09:41
@rebtoor rebtoor force-pushed the ceph-migrate-fix branch 2 times, most recently from f7c69e3 to 3321964 Compare November 14, 2025 09:10
@softwarefactory-project-zuul
Copy link

Build failed (check pipeline). Post recheck (without leading slash)
to rerun all jobs. Make sure the failure cause has been resolved before
you rerun jobs.

https://softwarefactory-project.io/zuul/t/rdoproject.org/buildset/5e74b248cbf849bcb51544e4395ba3fd

✔️ noop SUCCESS in 0s
✔️ adoption-standalone-to-crc-ceph SUCCESS in 3h 03m 21s
adoption-standalone-to-crc-no-ceph NODE_FAILURE Node request 100-0008086801 failed in 0s

@rebtoor
Copy link
Contributor Author

rebtoor commented Nov 14, 2025

recheck

Add a task in post.yaml to trigger Ceph manager failover when the
cluster health status is not HEALTH_OK. This helps recover from
degraded cluster states after migration operations.

The task installs the cephadm package on all ComputeHCI nodes and
then executes 'cephadm shell -- ceph mgr fail' on the first compute
node. This approach avoids container-based CLI complexity and uses
the native cephadm tool available on compute nodes where Ceph
daemons are running.

The task only runs when ComputeHCI nodes are available and the
cluster health is degraded (HEALTH_WARN or HEALTH_ERR).

Signed-off-by: Roberto Alfieri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants