Skip to content

Conversation

tjungblu
Copy link
Contributor

This PR will actively try to move the leadership away from the bootstrap member to another healthy member.

/hold

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 19, 2024
@openshift-ci-robot openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 19, 2024
@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is invalid:

  • expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR will actively try to move the leadership away from the bootstrap member to another healthy member.

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci openshift-ci bot requested review from Elbehery and dusk125 November 19, 2024 13:22
@openshift-ci openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2024
@tjungblu tjungblu force-pushed the OCPBUGS-42810_moveleader branch 2 times, most recently from 5720e20 to a747960 Compare November 20, 2024 13:08
@tjungblu
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Nov 20, 2024

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ce14d7c0-a741-11ef-9f14-d84ba44f8a80-0

@tjungblu tjungblu force-pushed the OCPBUGS-42810_moveleader branch from a747960 to 0c031aa Compare November 20, 2024 13:25
@tjungblu
Copy link
Contributor Author

/retest

1 similar comment
@tjungblu
Copy link
Contributor Author

/retest

@tjungblu tjungblu force-pushed the OCPBUGS-42810_moveleader branch from 0c031aa to 369b1cd Compare November 28, 2024 13:06
@tjungblu
Copy link
Contributor Author

/cherry-pick release-4.18

@openshift-cherrypick-robot

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@tjungblu
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Nov 28, 2024

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/01c7f930-ada0-11ef-8674-28fc71df2147-0

@tjungblu
Copy link
Contributor Author

/retest-required

@tjungblu
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Nov 28, 2024

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e6fa4930-adab-11ef-8910-6373b3a73b44-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 2, 2024

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Dec 2, 2024

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37cd8410-b0a6-11ef-8f86-550f1a2ac5ba-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Dec 3, 2024

/retest

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Dec 3, 2024

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/63892fc0-b180-11ef-99d0-e245fa792cd0-0

err = c.removeBootstrap(timeoutCtx, safeToRemoveBootstrap, hasBootstrap, bootstrapID)
if hasBootstrap {
if err := c.ensureBootstrapIsNotLeader(ctx, bootstrapMember); err != nil {
klog.Errorf("error while ensuring bootstrap is not leader: %v", err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we return err here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not really, I wouldn't want the controller to degrade when that operation fails once. Mind you, this happens very very rarely and affects our CI pass rates in very few cases of hundreds.

@vrutkovs
Copy link
Member

vrutkovs commented Jan 8, 2025

/retest

@vrutkovs
Copy link
Member

vrutkovs commented Jan 8, 2025

/test e2e-metal-ovn-ha-cert-rotation-shutdown

@vrutkovs
Copy link
Member

/test e2e-metal-ovn-ha-cert-rotation-shutdown e2e-metal-ovn-sno-cert-rotation-shutdown

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 2, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Sep 2, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9aae7ad0-87e2-11f0-8e36-9df58d5d7d07-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 3, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Sep 3, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d13fd050-888a-11f0-8776-036a5a213b71-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 3, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Sep 3, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4b6983f0-88c0-11f0-9736-1b2d461ebfab-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 4, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

Copy link
Contributor

openshift-ci bot commented Sep 4, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/69083890-8954-11f0-9223-cf9f15e3c9ba-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 4, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20

Copy link
Contributor

openshift-ci bot commented Sep 4, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e345990-897e-11f0-8451-b91fd76131ee-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 5, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade 20

Copy link
Contributor

openshift-ci bot commented Sep 5, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1a7dde20-8a3d-11f0-9f46-8138e528642f-0

@tjungblu
Copy link
Contributor Author

tjungblu commented Sep 9, 2025

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade 20

Copy link
Contributor

openshift-ci bot commented Sep 9, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/94d60770-8d75-11f0-953c-decb559445d5-0

@tjungblu
Copy link
Contributor Author

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20

Copy link
Contributor

openshift-ci bot commented Sep 10, 2025

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

  • periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c92dad0-8e3b-11f0-99df-a62f065d382a-0

Copy link
Contributor

@hasbro17 hasbro17 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some nits and questions for my own sake but seems good overall!

return false, err
}

klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Super pedantic nit since you're already fixing spelling mistakes in this PR

Suggested change
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also while we're here, does MoveLeader() == nil mean the leadership was actually transferred, or we've only made a request to the leader, and it may still be in the process of stepping down or we're in the midst of an election.

Asking since the warning here says Moving instead of Moved.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay so it seems like it waits for the actual transfer to finish and not just a shallow request.

s.r.TransferLeadership(ctx, lead, transferee)
for s.Lead() != transferee {
select {
case <-ctx.Done(): // time out
return ErrTimeoutLeaderTransfer
case <-time.After(interval):
}
}

In that case wdyt about the past tense:

Suggested change
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yep, sounds good! thanks for proof reading this. I don't really understand why my spell checker in Goland picks up on the bOOstrap but not on the success 👍

}

if otherMember == nil {
return false, fmt.Errorf("no follower member found for the members: %v", memberList)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some more context on the operation in case this ever does happen.

Suggested change
return false, fmt.Errorf("no follower member found for the members: %v", memberList)
return false, fmt.Errorf("no follower member found for leadership transfer: %v", memberList)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thank you, applied!

Comment on lines +159 to +160
// if we have just moved it, we will skip this sync iteration to backoff the controller - the next resync will happen after a minute anyway
if moved {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems okay so not disagreeing here, but for my knowledge what is the potential negative impact of removing the member right after we transfer?
At first I thought we're waiting for the transfer to happen but as I found out below MoveLeader() seems to be synchronous.

If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.

Or just so we get a chance to update the status first.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.

great question: what are we waiting on here. My intuition was to avoid disrupting the API too much, the clients will get the LeaderChangedErr (Error from server: etcdserver: leader changed) on the move. Then potentially removing the member could also disrupt it further, e.g. during a static pod rollout.

I know this is highly hypothetical...
Unfortunately, I do not have any data yet on whether this would be really necessary. It is just too damn hard to reproduce this case.

Let me quickly put together a controller that will always make the bootstrap node the leader during the bootstrapping phase.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here's a run with this controller, you can grep the logs with TEST_ONLY

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1369/pull-ci-openshift-cluster-etcd-operator-main-e2e-azure/1967558360303669248

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-etcd-operator-main-e2e-azure%2F1967558360303669248%5C%22%7D%20%7C~%20%5C%22TEST_ONLY%7Cremoved%20bootstrap%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221757931212276%22,%22to%22:%221757967212278%22%7D%7D

it also seems there was no actual move being done before removing, even though we have moved the leader to the bootstrap node a minute earlier.

it still kinda shows what I was worried about:

I0915 13:00:26.367586 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"6e338218-3237-4f73-9c33-f83a5f40b8a0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Bootstrap member removed' successfully removed bootstrap member [32326af0e011bf0a]

E0915 13:00:26.377135 1 event.go:359] "Server rejected event (will not retry!)" err="rpc error: code = Unknown desc = raft: stopped" event="&Event{ObjectMeta:{etcd-operator.1865763b3da4743d openshift-etcd-operator 0 0001-01-01 00:00:00 +0000 UTC map[] map[] [] []

it seems that this is the case of scaling from 4->3, but during a static pod rollout, so that breaks quorum and causes raft to stop.

Anyway, I need a bit more time to test this, so I'll pick this up again when I have some time left with RIT :)

Copy link
Contributor

openshift-ci bot commented Sep 11, 2025

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:
  • OWNERS [dusk125,hasbro17,tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@hasbro17
Copy link
Contributor

/retest-required

This PR will actively try to move the leadership away from the bootstrap member to another member.

Signed-off-by: Thomas Jungblut <[email protected]>
@tjungblu tjungblu force-pushed the OCPBUGS-42810_moveleader branch from 994ce3f to 42628ec Compare September 15, 2025 11:57
Copy link

coderabbitai bot commented Sep 15, 2025

Walkthrough

Adds leader movement capability and exposes etcd member status responses. Updates etcd client interfaces and fake implementations, introduces leadership helper utilities and tests, refactors bootstrap teardown to avoid removing a leader, adds a test-only bootstrap leadership controller, and wires it into operator startup.

Changes

Cohort / File(s) Change summary
Etcd client API and impl
pkg/etcdcli/etcdcli.go, pkg/etcdcli/helpers.go, pkg/etcdcli/interfaces.go
Introduced LeaderMover interface and MoveLeader method; updated EtcdClient to embed LeaderMover. Changed MemberStatus to return *clientv3.StatusResponse, error in interfaces, real client, and fake client. Added fake options to set/get leader and updated fake status handling.
Bootstrap teardown controller and tests
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go, pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go
Refactored bootstrap handling to use *etcdserverpb.Member instead of ID. Added leadership check/move before removal via ceohelpers. Introduced new condition/event for leader movement. Updated tests to new signatures, status responses, leader simulation, and assertions.
Leadership helpers and tests
pkg/operator/ceohelpers/leader.go, pkg/operator/ceohelpers/leader_test.go
Added FindLeader to aggregate leader from member statuses and validate consistency. Added MoveLeaderToAnotherMember to move leadership to a follower. Created unit tests with mocks covering success and error scenarios.
Bootstrap test controller and startup wiring
pkg/operator/bootstraptest/bootstrap_lead_test_controller.go, pkg/operator/starter.go
Added a test-only controller to attempt moving leadership to the bootstrap member; logs/events outcomes. Wired controller into operator startup with a 2-minute resync.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)
Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%. You can run @coderabbitai generate docstrings to improve docstring coverage.
✅ Passed checks (2 passed)
Check name Status Explanation
Title Check ✅ Passed The title "OCPBUGS-42810: actively move bootstrap member lead" concisely names the main change (actively moving leadership away from the bootstrap member) and references the tracking bug, which matches the implemented MoveLeader and related bootstrap logic. It is specific, focused, and suitable for a teammate scanning PR history. The phrasing is clear enough to convey the primary intent of the changeset.
Description Check ✅ Passed The description states that the PR will actively attempt to move leadership away from the bootstrap member, which directly matches the code changes (MoveLeader, leader detection, and bootstrap teardown updates). It is concise and on-topic rather than off-base or misleading. The "/hold" marker correctly indicates the author does not want this merged immediately.
✨ Finishing touches
  • 📝 Generate Docstrings
🧪 Generate unit tests
  • Create PR with unit tests
  • Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

  • Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
  • Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@tjungblu
Copy link
Contributor Author

/jira refresh

@openshift-ci-robot openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 15, 2025
@openshift-ci-robot
Copy link

@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug
  • bug is open, matching expected state (open)
  • bug target version (4.21.0) matches configured target version for branch (4.21.0)
  • bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

@openshift-ci-robot openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 15, 2025
@openshift-ci openshift-ci bot requested a review from geliu2016 September 15, 2025 11:58
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (1)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)

154-174: Leadership move and backoff strategy looks good

The implementation correctly:

  1. Ensures bootstrap is not the leader before removal
  2. Backs off for a resync cycle after moving leadership
  3. Records appropriate events and conditions

This aligns with the discussion about avoiding API disruption during leadership changes.

🧹 Nitpick comments (8)
pkg/etcdcli/helpers.go (3)

108-114: Consider adding member existence validation in MoveLeader.

The method updates the leader for all statuses without verifying if toMember actually exists in the cluster. This could lead to an inconsistent state in tests.

Add validation to ensure the target member exists:

 func (f *fakeEtcdClient) MoveLeader(ctx context.Context, toMember uint64) error {
+	// Validate that the target member exists
+	memberExists := false
+	for _, m := range f.members {
+		if m.ID == toMember {
+			memberExists = true
+			break
+		}
+	}
+	if !memberExists {
+		return fmt.Errorf("target member %d does not exist", toMember)
+	}
+
 	for _, status := range f.opts.status {
 		status.Leader = toMember
 	}
 
 	return nil
 }

173-185: Potential inconsistency in default leader assignment.

When no status is found for a member, the method returns a default StatusResponse with the member set as its own leader. This might not reflect the actual cluster state and could lead to incorrect leader detection logic.

Consider returning an error or setting the leader to 0 (no leader) instead:

 func (f *fakeEtcdClient) MemberStatus(ctx context.Context, member *etcdserverpb.Member) (*clientv3.StatusResponse, error) {
 	// Find the status for this member
 	for _, status := range f.opts.status {
 		if status.Header != nil && status.Header.MemberId == member.ID {
 			return status, nil
 		}
 	}
-	// Return a default status if none found
-	return &clientv3.StatusResponse{
-		Header: &etcdserverpb.ResponseHeader{MemberId: member.ID},
-		Leader: member.ID, // Default to self as leader
-	}, nil
+	// Return an error if status not found
+	return nil, fmt.Errorf("no status found for member %d", member.ID)
 }

272-278: WithLeader doesn't initialize status headers.

The WithLeader option modifies existing statuses but doesn't ensure they have proper headers. This could cause nil pointer issues when the statuses are later accessed.

Consider initializing headers when setting the leader:

 func WithLeader(leader uint64) FakeClientOption {
 	return func(fo *FakeClientOptions) {
 		for _, status := range fo.status {
+			if status.Header == nil {
+				status.Header = &etcdserverpb.ResponseHeader{}
+			}
 			status.Leader = leader
 		}
 	}
 }
pkg/operator/ceohelpers/leader.go (2)

31-52: Consider adding retry logic for transient failures.

The MoveLeader operation might fail due to transient network issues or temporary unavailability. Consider implementing retry logic with exponential backoff for production resilience.

Also, the log message at line 50 should use Info level instead of Warning since this is a successful operation:

-	klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
+	klog.Infof("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

34-39: Consider a more intelligent follower selection strategy.

Currently, the function selects the first available follower. In production, you might want to consider factors like member health, load, or geographic distribution.

For better load distribution and resilience, consider:

  1. Checking member health status before selection
  2. Implementing round-robin or least-recently-used selection
  3. Avoiding members that are learners or unhealthy

This would help ensure leadership moves to the most suitable member rather than just the first available one.

pkg/operator/bootstraptest/bootstrap_lead_test_controller.go (1)

35-66: Consider adding metrics for test controller operations

This test controller modifies cluster state by moving leadership. For debugging and observability, consider adding metrics to track:

  • Number of leadership move attempts
  • Success/failure rates
  • Time taken for leadership moves
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (2)

105-107: Typo in function name

The function name has a typo: setSuccessfulBoostrapRemovalStatus should be setSuccessfulBootstrapRemovalStatus.

This is already fixed in the code (line 105 shows the correct spelling), but ensure consistency throughout.


272-273: Consider validating ClientURLs is not empty

The code accesses bootstrapMember.GetClientURLs()[0] without checking if the slice is non-empty. While this should always be populated in practice, adding a defensive check would prevent potential panics.

Apply this diff to add a defensive check:

-klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, bootstrapMember.GetClientURLs()[0])
+clientURL := "unknown"
+if len(bootstrapMember.GetClientURLs()) > 0 {
+  clientURL = bootstrapMember.GetClientURLs()[0]
+}
+klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, clientURL)
📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and 42628ec.

📒 Files selected for processing (9)
  • pkg/etcdcli/etcdcli.go (2 hunks)
  • pkg/etcdcli/helpers.go (3 hunks)
  • pkg/etcdcli/interfaces.go (3 hunks)
  • pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (7 hunks)
  • pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go (7 hunks)
  • pkg/operator/bootstraptest/bootstrap_lead_test_controller.go (1 hunks)
  • pkg/operator/ceohelpers/leader.go (1 hunks)
  • pkg/operator/ceohelpers/leader_test.go (1 hunks)
  • pkg/operator/starter.go (3 hunks)
🔇 Additional comments (9)
pkg/etcdcli/interfaces.go (3)

17-33: LGTM! Clean interface design.

The addition of LeaderMover to the EtcdClient interface is well-structured and follows the existing pattern of composing interfaces.


68-70: LGTM! Well-designed leader movement interface.

The LeaderMover interface is clean and focused with a clear single responsibility.


93-96: LGTM! Improved MemberStatus signature.

The updated signature returning (*clientv3.StatusResponse, error) provides more detailed information and better aligns with etcd's client API patterns.

pkg/etcdcli/etcdcli.go (2)

267-279: LGTM! Clean implementation of MoveLeader.

The implementation follows the established pattern with proper timeout handling and resource cleanup.


460-474: LGTM! Improved MemberStatus implementation.

The refactored implementation is cleaner and more direct, eliminating the previous status mapping logic in favor of returning the actual etcd status response.

pkg/operator/ceohelpers/leader.go (1)

12-29: LGTM! Well-structured leader detection logic.

The function properly validates consistency across members and handles errors appropriately.

pkg/operator/ceohelpers/leader_test.go (1)

1-315: Comprehensive test coverage!

Excellent test implementation with thorough coverage of edge cases including:

  • Various leader positions in the member list
  • Single-member clusters
  • Empty and nil member lists
  • Error handling scenarios
  • Inconsistent leader reporting

The table-driven approach with clear test names makes the tests maintainable and easy to understand.

pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go (1)

375-389: LGTM: Leader movement test case properly validates the new behavior

The new test case "safe, has bootstrap, moving leader, not removing the bootstrap member" correctly validates that:

  1. The bootstrap member is not removed when it's the leader
  2. The appropriate conditions are set (conditionEnoughEtcdMembers and conditionBootstrapLeaderMoved)
  3. No events are emitted (0 expected events)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)

257-274: Solid implementation of leader check and movement

The ensureBootstrapIsNotLeader method properly:

  1. Lists members to get current state
  2. Finds the current leader
  3. Only attempts to move leadership if bootstrap is the leader
  4. Delegates to the helper function for the actual move

The error handling is comprehensive and the logging provides good observability.

Comment on lines +52 to +62
if hasBootstrap {
klog.Warningf("TEST_ONLY moving leader to the bootstrap member")
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
if err != nil {
return err
}
if moved {
klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")
} else {
klog.Warningf("TEST_ONLY failed to move to the bootstrap member")
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Incorrect leader movement logic detected

The message at line 53 says "moving leader to the bootstrap member", but the code at line 54 calls MoveLeaderToAnotherMember which moves leadership away from the bootstrap member. The log messages are contradictory to the actual behavior.

Apply this diff to fix the log messages:

 if hasBootstrap {
-  klog.Warningf("TEST_ONLY moving leader to the bootstrap member")
+  klog.Warningf("TEST_ONLY moving leader away from the bootstrap member")
   moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
   if err != nil {
     return err
   }
   if moved {
-    klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")
+    klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member")
   } else {
-    klog.Warningf("TEST_ONLY failed to move to the bootstrap member")
+    klog.Warningf("TEST_ONLY failed to move away from the bootstrap member")
   }
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if hasBootstrap {
klog.Warningf("TEST_ONLY moving leader to the bootstrap member")
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
if err != nil {
return err
}
if moved {
klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")
} else {
klog.Warningf("TEST_ONLY failed to move to the bootstrap member")
}
if hasBootstrap {
klog.Warningf("TEST_ONLY moving leader away from the bootstrap member")
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
if err != nil {
return err
}
if moved {
klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member")
} else {
klog.Warningf("TEST_ONLY failed to move away from the bootstrap member")
}
}
🤖 Prompt for AI Agents
In pkg/operator/bootstraptest/bootstrap_lead_test_controller.go around lines 52
to 62, the log messages incorrectly state that the leader is being moved "to the
bootstrap member" while the code calls MoveLeaderToAnotherMember (which moves
the leader away from the bootstrap member); update the three klog.Warningf
message strings to accurately reflect the action (e.g., "TEST_ONLY moving leader
away from the bootstrap member", "TEST_ONLY successfully moved leader away from
the bootstrap member", "TEST_ONLY failed to move leader away from the bootstrap
member") so the logs match the actual behavior.

Comment on lines +471 to +473
// TODO(thomas): TEST ONLY, REMOVE AGAIN
bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Verification agent

🧩 Analysis chain

Test-only controller in production code needs removal plan.

The bootstrap test controller is marked as "TEST ONLY" but is being started in production code. This could have unintended side effects in production environments.

Please ensure this controller:

  1. Has appropriate feature gates or environment checks to prevent it from running in production
  2. Has a clear removal timeline and tracking issue

Consider gating this behind a test environment check:

-	// TODO(thomas): TEST ONLY, REMOVE AGAIN
-	bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)
+	// TODO(thomas): TEST ONLY, REMOVE AGAIN - Track in issue #XXX
+	var bootstrapTestController factory.Controller
+	if os.Getenv("ENABLE_TEST_CONTROLLERS") == "true" {
+		bootstrapTestController = bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)
+	}

Let me verify what this test controller does:


🏁 Script executed:

#!/bin/bash
# Check the implementation of the bootstrap test controller
fd -t f "bootstrap.*test.*controller\.go" --exec cat {}

Length of output: 1708


Critical: remove or gate TEST-ONLY bootstrap controller that moves etcd leader

The controller is instantiated unconditionally in pkg/operator/starter.go (around lines 471–473). Its sync() detects a member named "etcd-bootstrap" and calls ceohelpers.MoveLeaderToAnotherMember — this will attempt to change cluster leadership and must not run in production.

  • Do not run this controller in production: remove it or gate it behind an explicit, default‑off mechanism (build tag, operator feature‑gate, or an env var such as ENABLE_TEST_CONTROLLERS="true").
  • Add a tracked issue and TODO with a concrete removal timeline (reference the issue in the code).
  • If kept for local tests only, ensure it is only registered by test harness/startup logic and add CI checks that it is disabled in release images.

Locations: pkg/operator/starter.go (call site), bootstraptest package (sync implementation that calls MoveLeaderToAnotherMember).

Comment on lines +647 to +648
// TODO(thomas): remove again
go bootstrapTestController.Run(ctx, 1)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue

Test controller started unconditionally.

The test controller is being started without any conditional checks, meaning it will run in all environments including production.

Guard the controller execution:

 	go bootstrapTeardownController.Run(ctx, 1)
-	// TODO(thomas): remove again
-	go bootstrapTestController.Run(ctx, 1)
+	// TODO(thomas): remove again - Track in issue #XXX
+	if bootstrapTestController != nil {
+		go bootstrapTestController.Run(ctx, 1)
+	}
 	go unsupportedConfigOverridesController.Run(ctx, 1)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
// TODO(thomas): remove again
go bootstrapTestController.Run(ctx, 1)
go bootstrapTeardownController.Run(ctx, 1)
// TODO(thomas): remove again - Track in issue #XXX
if bootstrapTestController != nil {
go bootstrapTestController.Run(ctx, 1)
}
go unsupportedConfigOverridesController.Run(ctx, 1)
🤖 Prompt for AI Agents
In pkg/operator/starter.go around lines 647-648 the test controller is started
unconditionally (go bootstrapTestController.Run(ctx, 1)), which will run in all
environments; wrap this call in a guard so it only runs in non-production/test
environments (e.g., check a config flag, environment variable like
ENABLE_TEST_CONTROLLER or an existing runtime mode/IsDev flag), remove or honor
the TODO, and log when the test controller is enabled/disabled; ensure the
default for production is disabled so the controller is not started unless
explicitly enabled.

Copy link
Contributor

openshift-ci bot commented Sep 15, 2025

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e-aws-ovn-serial 42628ec link true /test e2e-aws-ovn-serial
ci/prow/e2e-gcp-disruptive 42628ec link false /test e2e-gcp-disruptive
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown 42628ec link false /test e2e-metal-ovn-ha-cert-rotation-shutdown
ci/prow/e2e-aws-cpms 42628ec link true /test e2e-aws-cpms
ci/prow/e2e-aws-disruptive 42628ec link false /test e2e-aws-disruptive
ci/prow/e2e-gcp-ovn-etcd-scaling 42628ec link false /test e2e-gcp-ovn-etcd-scaling
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown 42628ec link false /test e2e-metal-ovn-sno-cert-rotation-shutdown
ci/prow/e2e-aws-ovn-single-node 42628ec link true /test e2e-aws-ovn-single-node
ci/prow/e2e-gcp-disruptive-ovn 42628ec link false /test e2e-gcp-disruptive-ovn
ci/prow/e2e-azure-ovn-etcd-scaling 42628ec link false /test e2e-azure-ovn-etcd-scaling
ci/prow/e2e-metal-assisted 42628ec link true /test e2e-metal-assisted
ci/prow/e2e-vsphere-ovn-etcd-scaling 42628ec link false /test e2e-vsphere-ovn-etcd-scaling
ci/prow/e2e-aws-etcd-certrotation 42628ec link false /test e2e-aws-etcd-certrotation
ci/prow/e2e-operator-fips 42628ec link false /test e2e-operator-fips
ci/prow/e2e-aws-etcd-recovery 42628ec link false /test e2e-aws-etcd-recovery
ci/prow/e2e-metal-ovn-two-node-fencing 42628ec link false /test e2e-metal-ovn-two-node-fencing
ci/prow/unit 42628ec link true /test unit
ci/prow/e2e-aws-ovn-etcd-scaling 42628ec link true /test e2e-aws-ovn-etcd-scaling
ci/prow/e2e-aws-disruptive-ovn 42628ec link false /test e2e-aws-disruptive-ovn

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. jira/valid-reference Indicates that this PR references a valid Jira ticket of any type.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

9 participants