OCPBUGS-42810: actively move bootstrap member lead #1369

tjungblu · 2024-11-19T13:21:47Z

This PR will actively try to move the leadership away from the bootstrap member to another healthy member.

/hold

openshift-ci-robot · 2024-11-19T13:21:53Z

@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is invalid:

expected the bug to target the "4.18.0" version, but no target version was set

Comment /jira refresh to re-evaluate validity if changes to the Jira bug are made, or edit the title of this pull request to link to a different bug.

The bug has been updated to refer to the pull request using the external bug tracker.

In response to this:

This PR will actively try to move the leadership away from the bootstrap member to another healthy member.

/hold

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

tjungblu · 2024-11-20T13:17:33Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

openshift-ci · 2024-11-20T13:17:37Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ce14d7c0-a741-11ef-9f14-d84ba44f8a80-0

tjungblu · 2024-11-25T11:59:24Z

/retest

tjungblu · 2024-11-27T08:42:15Z

/retest

tjungblu · 2024-11-28T13:07:30Z

/cherry-pick release-4.18

openshift-cherrypick-robot · 2024-11-28T13:07:33Z

@tjungblu: once the present PR merges, I will cherry-pick it on top of release-4.18 in a new PR and assign it to you.

In response to this:

/cherry-pick release-4.18

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

tjungblu · 2024-11-28T15:46:59Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

openshift-ci · 2024-11-28T15:47:05Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/01c7f930-ada0-11ef-8674-28fc71df2147-0

tjungblu · 2024-11-28T17:11:40Z

/retest-required

tjungblu · 2024-11-28T17:12:08Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

openshift-ci · 2024-11-28T17:12:12Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e6fa4930-adab-11ef-8910-6373b3a73b44-0

tjungblu · 2024-12-02T12:09:01Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

openshift-ci · 2024-12-02T12:09:21Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37cd8410-b0a6-11ef-8f86-550f1a2ac5ba-0

tjungblu · 2024-12-03T14:10:44Z

/retest

/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10

openshift-ci · 2024-12-03T14:10:48Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/63892fc0-b180-11ef-99d0-e245fa792cd0-0

lance5890 · 2024-12-06T01:15:00Z

pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go

-	err = c.removeBootstrap(timeoutCtx, safeToRemoveBootstrap, hasBootstrap, bootstrapID)
+	if hasBootstrap {
+		if err := c.ensureBootstrapIsNotLeader(ctx, bootstrapMember); err != nil {
+			klog.Errorf("error while ensuring bootstrap is not leader: %v", err)


should we return err here?

not really, I wouldn't want the controller to degrade when that operation fails once. Mind you, this happens very very rarely and affects our CI pass rates in very few cases of hundreds.

vrutkovs · 2025-01-08T09:53:35Z

/retest

vrutkovs · 2025-01-08T12:52:33Z

/test e2e-metal-ovn-ha-cert-rotation-shutdown

vrutkovs · 2025-01-14T13:30:35Z

/test e2e-metal-ovn-ha-cert-rotation-shutdown e2e-metal-ovn-sno-cert-rotation-shutdown

tjungblu · 2025-09-02T09:52:56Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

openshift-ci · 2025-09-02T09:52:59Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9aae7ad0-87e2-11f0-8e36-9df58d5d7d07-0

tjungblu · 2025-09-03T05:57:03Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

openshift-ci · 2025-09-03T05:57:07Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d13fd050-888a-11f0-8776-036a5a213b71-0

tjungblu · 2025-09-03T12:19:51Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

openshift-ci · 2025-09-03T12:20:25Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4b6983f0-88c0-11f0-9736-1b2d461ebfab-0

tjungblu · 2025-09-04T06:00:07Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10

openshift-ci · 2025-09-04T06:00:14Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/69083890-8954-11f0-9223-cf9f15e3c9ba-0

tjungblu · 2025-09-04T11:01:20Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20

openshift-ci · 2025-09-04T11:01:25Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e345990-897e-11f0-8451-b91fd76131ee-0

tjungblu · 2025-09-05T09:45:48Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade 20

openshift-ci · 2025-09-05T09:45:51Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1a7dde20-8a3d-11f0-9f46-8138e528642f-0

tjungblu · 2025-09-09T12:07:38Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade 20

openshift-ci · 2025-09-09T12:08:22Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/94d60770-8d75-11f0-953c-decb559445d5-0

tjungblu · 2025-09-10T11:45:11Z

/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20

openshift-ci · 2025-09-10T11:45:19Z

@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command

periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade

See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c92dad0-8e3b-11f0-99df-a62f065d382a-0

hasbro17

Some nits and questions for my own sake but seems good overall!

hasbro17 · 2025-09-11T19:27:43Z

pkg/operator/ceohelpers/leader.go

+		return false, err
+	}
+
+	klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])


Super pedantic nit since you're already fixing spelling mistakes in this PR

Suggested change

klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

Also while we're here, does MoveLeader() == nil mean the leadership was actually transferred, or we've only made a request to the leader, and it may still be in the process of stepping down or we're in the midst of an election.

Asking since the warning here says Moving instead of Moved.

Okay so it seems like it waits for the actual transfer to finish and not just a shallow request.

cluster-etcd-operator/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go

Lines 1513 to 1520 in 61b7fb5

s.r.TransferLeadership(ctx, lead, transferee)

for s.Lead() != transferee {

select {

case <-ctx.Done(): // time out

return ErrTimeoutLeaderTransfer

case <-time.After(interval):

}

}

In that case wdyt about the past tense:

Suggested change

klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

yep, sounds good! thanks for proof reading this. I don't really understand why my spell checker in Goland picks up on the bOOstrap but not on the success 👍

hasbro17 · 2025-09-11T19:37:30Z

pkg/operator/ceohelpers/leader.go

+	}
+
+	if otherMember == nil {
+		return false, fmt.Errorf("no follower member found for the members: %v", memberList)


Just some more context on the operation in case this ever does happen.

Suggested change

return false, fmt.Errorf("no follower member found for the members: %v", memberList)

return false, fmt.Errorf("no follower member found for leadership transfer: %v", memberList)

thank you, applied!

hasbro17 · 2025-09-11T20:16:08Z

pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go

+	// if we have just moved it, we will skip this sync iteration to backoff the controller - the next resync will happen after a minute anyway
+	if moved {


This seems okay so not disagreeing here, but for my knowledge what is the potential negative impact of removing the member right after we transfer?
At first I thought we're waiting for the transfer to happen but as I found out below MoveLeader() seems to be synchronous.

If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.

Or just so we get a chance to update the status first.

If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.

great question: what are we waiting on here. My intuition was to avoid disrupting the API too much, the clients will get the LeaderChangedErr (Error from server: etcdserver: leader changed) on the move. Then potentially removing the member could also disrupt it further, e.g. during a static pod rollout.

I know this is highly hypothetical...
Unfortunately, I do not have any data yet on whether this would be really necessary. It is just too damn hard to reproduce this case.

Let me quickly put together a controller that will always make the bootstrap node the leader during the bootstrapping phase.

here's a run with this controller, you can grep the logs with TEST_ONLY

https://prow.ci.openshift.org/view/gs/test-platform-results/pr-logs/pull/openshift_cluster-etcd-operator/1369/pull-ci-openshift-cluster-etcd-operator-main-e2e-azure/1967558360303669248

https://grafana-loki.ci.openshift.org/explore?orgId=1&left=%7B%22datasource%22:%22PCEB727DF2F34084E%22,%22queries%22:%5B%7B%22expr%22:%22%7Binvoker%3D%5C%22openshift-internal-ci%2Fpull-ci-openshift-cluster-etcd-operator-main-e2e-azure%2F1967558360303669248%5C%22%7D%20%7C~%20%5C%22TEST_ONLY%7Cremoved%20bootstrap%5C%22%22,%22refId%22:%22A%22,%22editorMode%22:%22code%22,%22queryType%22:%22range%22%7D%5D,%22range%22:%7B%22from%22:%221757931212276%22,%22to%22:%221757967212278%22%7D%7D

it also seems there was no actual move being done before removing, even though we have moved the leader to the bootstrap node a minute earlier.

it still kinda shows what I was worried about:

I0915 13:00:26.367586 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"6e338218-3237-4f73-9c33-f83a5f40b8a0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Bootstrap member removed' successfully removed bootstrap member [32326af0e011bf0a]

E0915 13:00:26.377135 1 event.go:359] "Server rejected event (will not retry!)" err="rpc error: code = Unknown desc = raft: stopped" event="&Event{ObjectMeta:{etcd-operator.1865763b3da4743d openshift-etcd-operator 0 0001-01-01 00:00:00 +0000 UTC map[] map[] [] []

it seems that this is the case of scaling from 4->3, but during a static pod rollout, so that breaks quorum and causes raft to stop.

Anyway, I need a bit more time to test this, so I'll pick this up again when I have some time left with RIT :)

openshift-ci · 2025-09-11T20:18:30Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: dusk125, hasbro17, tjungblu

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [dusk125,hasbro17,tjungblu]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

hasbro17 · 2025-09-11T20:24:24Z

/retest-required

This PR will actively try to move the leadership away from the bootstrap member to another member. Signed-off-by: Thomas Jungblut <[email protected]>

coderabbitai · 2025-09-15T11:57:21Z

Walkthrough

Adds leader movement capability and exposes etcd member status responses. Updates etcd client interfaces and fake implementations, introduces leadership helper utilities and tests, refactors bootstrap teardown to avoid removing a leader, adds a test-only bootstrap leadership controller, and wires it into operator startup.

Changes

Cohort / File(s)	Change summary
Etcd client API and impl `pkg/etcdcli/etcdcli.go`, `pkg/etcdcli/helpers.go`, `pkg/etcdcli/interfaces.go`	Introduced LeaderMover interface and MoveLeader method; updated EtcdClient to embed LeaderMover. Changed MemberStatus to return `*clientv3.StatusResponse, error` in interfaces, real client, and fake client. Added fake options to set/get leader and updated fake status handling.
Bootstrap teardown controller and tests `pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go`, `pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go`	Refactored bootstrap handling to use `*etcdserverpb.Member` instead of ID. Added leadership check/move before removal via ceohelpers. Introduced new condition/event for leader movement. Updated tests to new signatures, status responses, leader simulation, and assertions.
Leadership helpers and tests `pkg/operator/ceohelpers/leader.go`, `pkg/operator/ceohelpers/leader_test.go`	Added FindLeader to aggregate leader from member statuses and validate consistency. Added MoveLeaderToAnotherMember to move leadership to a follower. Created unit tests with mocks covering success and error scenarios.
Bootstrap test controller and startup wiring `pkg/operator/bootstraptest/bootstrap_lead_test_controller.go`, `pkg/operator/starter.go`	Added a test-only controller to attempt moving leadership to the bootstrap member; logs/events outcomes. Wired controller into operator startup with a 2-minute resync.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Pre-merge checks and finishing touches

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 8.33% which is insufficient. The required threshold is 80.00%.	You can run `@coderabbitai generate docstrings` to improve docstring coverage.

✅ Passed checks (2 passed)

Check name	Status	Explanation
Title Check	✅ Passed	The title "OCPBUGS-42810: actively move bootstrap member lead" concisely names the main change (actively moving leadership away from the bootstrap member) and references the tracking bug, which matches the implemented MoveLeader and related bootstrap logic. It is specific, focused, and suitable for a teammate scanning PR history. The phrasing is clear enough to convey the primary intent of the changeset.
Description Check	✅ Passed	The description states that the PR will actively attempt to move leadership away from the bootstrap member, which directly matches the code changes (MoveLeader, leader detection, and bootstrap teardown updates). It is concise and on-topic rather than off-base or misleading. The "/hold" marker correctly indicates the author does not want this merged immediately.

✨ Finishing touches

📝 Generate Docstrings

🧪 Generate unit tests

Create PR with unit tests
Post copyable unit tests in a comment

Tip

👮 Agentic pre-merge checks are now available in preview!

Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.

Built-in checks – Quickly apply ready-made checks to enforce title conventions, require pull request descriptions that follow templates, validate linked issues for compliance, and more.
Custom agentic checks – Define your own rules using CodeRabbit’s advanced agentic capabilities to enforce organization-specific policies and workflows. For example, you can instruct CodeRabbit’s agent to verify that API documentation is updated whenever API schema files are modified in a PR. Note: Upto 5 custom checks are currently allowed during the preview period. Pricing for this feature will be announced in a few weeks.

Please see the documentation for more information.

Example:

reviews:
  pre_merge_checks:
    custom_checks:
      - name: "Undocumented Breaking Changes"
        mode: "warning"
        instructions: |
          Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal).

Please share your feedback with us on this Discord post.

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

tjungblu · 2025-09-15T11:57:49Z

/jira refresh

openshift-ci-robot · 2025-09-15T11:57:56Z

@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is valid. The bug has been moved to the POST state.

3 validation(s) were run on this bug

bug is open, matching expected state (open)
bug target version (4.21.0) matches configured target version for branch (4.21.0)
bug is in the state New, which is one of the valid states (NEW, ASSIGNED, POST)

Requesting review from QA contact:
/cc @geliu2016

In response to this:

/jira refresh

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository.

coderabbitai

Actionable comments posted: 3

♻️ Duplicate comments (1)

pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)

154-174: Leadership move and backoff strategy looks good

The implementation correctly:

Ensures bootstrap is not the leader before removal

Backs off for a resync cycle after moving leadership

Records appropriate events and conditions

This aligns with the discussion about avoiding API disruption during leadership changes.

🧹 Nitpick comments (8)

pkg/etcdcli/helpers.go (3)
108-114: Consider adding member existence validation in MoveLeader.

The method updates the leader for all statuses without verifying if toMember actually exists in the cluster. This could lead to an inconsistent state in tests.

Add validation to ensure the target member exists:
 func (f *fakeEtcdClient) MoveLeader(ctx context.Context, toMember uint64) error {
+	// Validate that the target member exists
+	memberExists := false
+	for _, m := range f.members {
+		if m.ID == toMember {
+			memberExists = true
+			break
+		}
+	}
+	if !memberExists {
+		return fmt.Errorf("target member %d does not exist", toMember)
+	}
+
 	for _, status := range f.opts.status {
 		status.Leader = toMember
 	}
 
 	return nil
 }
173-185: Potential inconsistency in default leader assignment.

When no status is found for a member, the method returns a default StatusResponse with the member set as its own leader. This might not reflect the actual cluster state and could lead to incorrect leader detection logic.

Consider returning an error or setting the leader to 0 (no leader) instead:
 func (f *fakeEtcdClient) MemberStatus(ctx context.Context, member *etcdserverpb.Member) (*clientv3.StatusResponse, error) {
 	// Find the status for this member
 	for _, status := range f.opts.status {
 		if status.Header != nil && status.Header.MemberId == member.ID {
 			return status, nil
 		}
 	}
-	// Return a default status if none found
-	return &clientv3.StatusResponse{
-		Header: &etcdserverpb.ResponseHeader{MemberId: member.ID},
-		Leader: member.ID, // Default to self as leader
-	}, nil
+	// Return an error if status not found
+	return nil, fmt.Errorf("no status found for member %d", member.ID)
 }
272-278: WithLeader doesn't initialize status headers.

The WithLeader option modifies existing statuses but doesn't ensure they have proper headers. This could cause nil pointer issues when the statuses are later accessed.

Consider initializing headers when setting the leader:
 func WithLeader(leader uint64) FakeClientOption {
 	return func(fo *FakeClientOptions) {
 		for _, status := range fo.status {
+			if status.Header == nil {
+				status.Header = &etcdserverpb.ResponseHeader{}
+			}
 			status.Leader = leader
 		}
 	}
 }
pkg/operator/ceohelpers/leader.go (2)
31-52: Consider adding retry logic for transient failures.

The MoveLeader operation might fail due to transient network issues or temporary unavailability. Consider implementing retry logic with exponential backoff for production resilience.

Also, the log message at line 50 should use Info level instead of Warning since this is a successful operation:
-	klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
+	klog.Infof("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
34-39: Consider a more intelligent follower selection strategy.

Currently, the function selects the first available follower. In production, you might want to consider factors like member health, load, or geographic distribution.

For better load distribution and resilience, consider:

Checking member health status before selection

Implementing round-robin or least-recently-used selection

Avoiding members that are learners or unhealthy

This would help ensure leadership moves to the most suitable member rather than just the first available one.
pkg/operator/bootstraptest/bootstrap_lead_test_controller.go (1)

35-66: Consider adding metrics for test controller operations

This test controller modifies cluster state by moving leadership. For debugging and observability, consider adding metrics to track:

Number of leadership move attempts

Success/failure rates

Time taken for leadership moves
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (2)
105-107: Typo in function name

The function name has a typo: setSuccessfulBoostrapRemovalStatus should be setSuccessfulBootstrapRemovalStatus.

This is already fixed in the code (line 105 shows the correct spelling), but ensure consistency throughout.

272-273: Consider validating ClientURLs is not empty

The code accesses bootstrapMember.GetClientURLs()[0] without checking if the slice is non-empty. While this should always be populated in practice, adding a defensive check would prevent potential panics.

Apply this diff to add a defensive check:
-klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, bootstrapMember.GetClientURLs()[0])
+clientURL := "unknown"
+if len(bootstrapMember.GetClientURLs()) > 0 {
+  clientURL = bootstrapMember.GetClientURLs()[0]
+}
+klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, clientURL)

📜 Review details

Configuration used: CodeRabbit UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61b7fb5 and 42628ec.

📒 Files selected for processing (9)

pkg/etcdcli/etcdcli.go (2 hunks)
pkg/etcdcli/helpers.go (3 hunks)
pkg/etcdcli/interfaces.go (3 hunks)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (7 hunks)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go (7 hunks)
pkg/operator/bootstraptest/bootstrap_lead_test_controller.go (1 hunks)
pkg/operator/ceohelpers/leader.go (1 hunks)
pkg/operator/ceohelpers/leader_test.go (1 hunks)
pkg/operator/starter.go (3 hunks)

🔇 Additional comments (9)

pkg/etcdcli/interfaces.go (3)

17-33: LGTM! Clean interface design.

The addition of LeaderMover to the EtcdClient interface is well-structured and follows the existing pattern of composing interfaces.

68-70: LGTM! Well-designed leader movement interface.

The LeaderMover interface is clean and focused with a clear single responsibility.

93-96: LGTM! Improved MemberStatus signature.

The updated signature returning (*clientv3.StatusResponse, error) provides more detailed information and better aligns with etcd's client API patterns.

pkg/etcdcli/etcdcli.go (2)

267-279: LGTM! Clean implementation of MoveLeader.

The implementation follows the established pattern with proper timeout handling and resource cleanup.

460-474: LGTM! Improved MemberStatus implementation.

The refactored implementation is cleaner and more direct, eliminating the previous status mapping logic in favor of returning the actual etcd status response.

pkg/operator/ceohelpers/leader.go (1)

12-29: LGTM! Well-structured leader detection logic.

The function properly validates consistency across members and handles errors appropriately.

pkg/operator/ceohelpers/leader_test.go (1)

1-315: Comprehensive test coverage!

Excellent test implementation with thorough coverage of edge cases including:

Various leader positions in the member list

Single-member clusters

Empty and nil member lists

Error handling scenarios

Inconsistent leader reporting

The table-driven approach with clear test names makes the tests maintainable and easy to understand.

pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go (1)

375-389: LGTM: Leader movement test case properly validates the new behavior

The new test case "safe, has bootstrap, moving leader, not removing the bootstrap member" correctly validates that:

The bootstrap member is not removed when it's the leader

The appropriate conditions are set (conditionEnoughEtcdMembers and conditionBootstrapLeaderMoved)

No events are emitted (0 expected events)

pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)

257-274: Solid implementation of leader check and movement

The ensureBootstrapIsNotLeader method properly:

Lists members to get current state

Finds the current leader

Only attempts to move leadership if bootstrap is the leader

Delegates to the helper function for the actual move

The error handling is comprehensive and the logging provides good observability.

coderabbitai · 2025-09-15T12:07:04Z

pkg/operator/bootstraptest/bootstrap_lead_test_controller.go

+	if hasBootstrap {
+		klog.Warningf("TEST_ONLY moving leader to the bootstrap member")
+		moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
+		if err != nil {
+			return err
+		}
+		if moved {
+			klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")
+		} else {
+			klog.Warningf("TEST_ONLY failed to move to the bootstrap member")
+		}


⚠️ Potential issue

Incorrect leader movement logic detected

The message at line 53 says "moving leader to the bootstrap member", but the code at line 54 calls MoveLeaderToAnotherMember which moves leadership away from the bootstrap member. The log messages are contradictory to the actual behavior.

Apply this diff to fix the log messages:

if hasBootstrap { - klog.Warningf("TEST_ONLY moving leader to the bootstrap member") + klog.Warningf("TEST_ONLY moving leader away from the bootstrap member") moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members) if err != nil { return err } if moved { - klog.Warningf("TEST_ONLY successfully moved to the bootstrap member") + klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member") } else { - klog.Warningf("TEST_ONLY failed to move to the bootstrap member") + klog.Warningf("TEST_ONLY failed to move away from the bootstrap member") } }

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

if hasBootstrap {

klog.Warningf("TEST_ONLY moving leader to the bootstrap member")

moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)

if err != nil {

return err

}

if moved {

klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")

} else {

klog.Warningf("TEST_ONLY failed to move to the bootstrap member")

}

if hasBootstrap {

klog.Warningf("TEST_ONLY moving leader away from the bootstrap member")

moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)

if err != nil {

return err

}

if moved {

klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member")

} else {

klog.Warningf("TEST_ONLY failed to move away from the bootstrap member")

}

}

🤖 Prompt for AI Agents

In pkg/operator/bootstraptest/bootstrap_lead_test_controller.go around lines 52 to 62, the log messages incorrectly state that the leader is being moved "to the bootstrap member" while the code calls MoveLeaderToAnotherMember (which moves the leader away from the bootstrap member); update the three klog.Warningf message strings to accurately reflect the action (e.g., "TEST_ONLY moving leader away from the bootstrap member", "TEST_ONLY successfully moved leader away from the bootstrap member", "TEST_ONLY failed to move leader away from the bootstrap member") so the logs match the actual behavior.

coderabbitai · 2025-09-15T12:07:04Z

pkg/operator/starter.go

+	// TODO(thomas): TEST ONLY, REMOVE AGAIN
+	bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)
+


💡 Verification agent

🧩 Analysis chain

Test-only controller in production code needs removal plan.

The bootstrap test controller is marked as "TEST ONLY" but is being started in production code. This could have unintended side effects in production environments.

Please ensure this controller:

Has appropriate feature gates or environment checks to prevent it from running in production

Has a clear removal timeline and tracking issue

Consider gating this behind a test environment check:

- // TODO(thomas): TEST ONLY, REMOVE AGAIN - bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder) + // TODO(thomas): TEST ONLY, REMOVE AGAIN - Track in issue #XXX + var bootstrapTestController factory.Controller + if os.Getenv("ENABLE_TEST_CONTROLLERS") == "true" { + bootstrapTestController = bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder) + }

Let me verify what this test controller does:

🏁 Script executed:

#!/bin/bash # Check the implementation of the bootstrap test controller fd -t f "bootstrap.*test.*controller\.go" --exec cat {}

Length of output: 1708

Critical: remove or gate TEST-ONLY bootstrap controller that moves etcd leader

The controller is instantiated unconditionally in pkg/operator/starter.go (around lines 471–473). Its sync() detects a member named "etcd-bootstrap" and calls ceohelpers.MoveLeaderToAnotherMember — this will attempt to change cluster leadership and must not run in production.

Do not run this controller in production: remove it or gate it behind an explicit, default‑off mechanism (build tag, operator feature‑gate, or an env var such as ENABLE_TEST_CONTROLLERS="true").

Add a tracked issue and TODO with a concrete removal timeline (reference the issue in the code).

If kept for local tests only, ensure it is only registered by test harness/startup logic and add CI checks that it is disabled in release images.

Locations: pkg/operator/starter.go (call site), bootstraptest package (sync implementation that calls MoveLeaderToAnotherMember).

coderabbitai · 2025-09-15T12:07:04Z

pkg/operator/starter.go

+	// TODO(thomas): remove again
+	go bootstrapTestController.Run(ctx, 1)


⚠️ Potential issue

Test controller started unconditionally.

The test controller is being started without any conditional checks, meaning it will run in all environments including production.

Guard the controller execution:

go bootstrapTeardownController.Run(ctx, 1) - // TODO(thomas): remove again - go bootstrapTestController.Run(ctx, 1) + // TODO(thomas): remove again - Track in issue #XXX + if bootstrapTestController != nil { + go bootstrapTestController.Run(ctx, 1) + } go unsupportedConfigOverridesController.Run(ctx, 1)

📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change

// TODO(thomas): remove again

go bootstrapTestController.Run(ctx, 1)

go bootstrapTeardownController.Run(ctx, 1)

// TODO(thomas): remove again - Track in issue #XXX

if bootstrapTestController != nil {

go bootstrapTestController.Run(ctx, 1)

}

go unsupportedConfigOverridesController.Run(ctx, 1)

🤖 Prompt for AI Agents

In pkg/operator/starter.go around lines 647-648 the test controller is started unconditionally (go bootstrapTestController.Run(ctx, 1)), which will run in all environments; wrap this call in a guard so it only runs in non-production/test environments (e.g., check a config flag, environment variable like ENABLE_TEST_CONTROLLER or an existing runtime mode/IsDev flag), remove or honor the TODO, and log when the test controller is enabled/disabled; ensure the default for production is disabled so the controller is not started unless explicitly enabled.

openshift-ci · 2025-09-15T16:58:24Z

@tjungblu: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e-aws-ovn-serial	`42628ec`	link	true	`/test e2e-aws-ovn-serial`
ci/prow/e2e-gcp-disruptive	`42628ec`	link	false	`/test e2e-gcp-disruptive`
ci/prow/e2e-metal-ovn-ha-cert-rotation-shutdown	`42628ec`	link	false	`/test e2e-metal-ovn-ha-cert-rotation-shutdown`
ci/prow/e2e-aws-cpms	`42628ec`	link	true	`/test e2e-aws-cpms`
ci/prow/e2e-aws-disruptive	`42628ec`	link	false	`/test e2e-aws-disruptive`
ci/prow/e2e-gcp-ovn-etcd-scaling	`42628ec`	link	false	`/test e2e-gcp-ovn-etcd-scaling`
ci/prow/e2e-metal-ovn-sno-cert-rotation-shutdown	`42628ec`	link	false	`/test e2e-metal-ovn-sno-cert-rotation-shutdown`
ci/prow/e2e-aws-ovn-single-node	`42628ec`	link	true	`/test e2e-aws-ovn-single-node`
ci/prow/e2e-gcp-disruptive-ovn	`42628ec`	link	false	`/test e2e-gcp-disruptive-ovn`
ci/prow/e2e-azure-ovn-etcd-scaling	`42628ec`	link	false	`/test e2e-azure-ovn-etcd-scaling`
ci/prow/e2e-metal-assisted	`42628ec`	link	true	`/test e2e-metal-assisted`
ci/prow/e2e-vsphere-ovn-etcd-scaling	`42628ec`	link	false	`/test e2e-vsphere-ovn-etcd-scaling`
ci/prow/e2e-aws-etcd-certrotation	`42628ec`	link	false	`/test e2e-aws-etcd-certrotation`
ci/prow/e2e-operator-fips	`42628ec`	link	false	`/test e2e-operator-fips`
ci/prow/e2e-aws-etcd-recovery	`42628ec`	link	false	`/test e2e-aws-etcd-recovery`
ci/prow/e2e-metal-ovn-two-node-fencing	`42628ec`	link	false	`/test e2e-metal-ovn-two-node-fencing`
ci/prow/unit	`42628ec`	link	true	`/test unit`
ci/prow/e2e-aws-ovn-etcd-scaling	`42628ec`	link	true	`/test e2e-aws-ovn-etcd-scaling`
ci/prow/e2e-aws-disruptive-ovn	`42628ec`	link	false	`/test e2e-aws-disruptive-ovn`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Nov 19, 2024

openshift-ci-robot added jira/valid-reference Indicates that this PR references a valid Jira ticket of any type. jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. labels Nov 19, 2024

openshift-ci bot requested review from Elbehery and dusk125 November 19, 2024 13:22

openshift-ci bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Nov 19, 2024

tjungblu force-pushed the OCPBUGS-42810_moveleader branch 2 times, most recently from 5720e20 to a747960 Compare November 20, 2024 13:08

tjungblu force-pushed the OCPBUGS-42810_moveleader branch from a747960 to 0c031aa Compare November 20, 2024 13:25

tjungblu force-pushed the OCPBUGS-42810_moveleader branch from 0c031aa to 369b1cd Compare November 28, 2024 13:06

lance5890 reviewed Dec 6, 2024

View reviewed changes

hasbro17 approved these changes Sep 11, 2025

View reviewed changes

tjungblu added 2 commits September 15, 2025 13:54

OCPBUGS-42810: actively move bootstrap member lead

a459851

This PR will actively try to move the leadership away from the bootstrap member to another member. Signed-off-by: Thomas Jungblut <[email protected]>

TEST_ONLY force bootstrap member being the leader

42628ec

tjungblu force-pushed the OCPBUGS-42810_moveleader branch from 994ce3f to 42628ec Compare September 15, 2025 11:57

openshift-ci-robot added the jira/valid-bug Indicates that a referenced Jira bug is valid for the branch this PR is targeting. label Sep 15, 2025

openshift-ci-robot removed the jira/invalid-bug Indicates that a referenced Jira bug is invalid for the branch this PR is targeting. label Sep 15, 2025

openshift-ci bot requested a review from geliu2016 September 15, 2025 11:58

coderabbitai bot reviewed Sep 15, 2025

View reviewed changes

	klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
	klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

	s.r.TransferLeadership(ctx, lead, transferee)
	for s.Lead() != transferee {
	select {
	case <-ctx.Done(): // time out
	return ErrTimeoutLeaderTransfer
	case <-time.After(interval):
	}
	}

	klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
	klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])

	return false, fmt.Errorf("no follower member found for the members: %v", memberList)
	return false, fmt.Errorf("no follower member found for leadership transfer: %v", memberList)

		// if we have just moved it, we will skip this sync iteration to backoff the controller - the next resync will happen after a minute anyway
		if moved {

		// TODO(thomas): TEST ONLY, REMOVE AGAIN
		bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)

		// TODO(thomas): remove again
		go bootstrapTestController.Run(ctx, 1)

-	// TODO(thomas): remove again
-	go bootstrapTestController.Run(ctx, 1)
+	go bootstrapTeardownController.Run(ctx, 1)
+	// TODO(thomas): remove again - Track in issue #XXX
+	if bootstrapTestController != nil {
+		go bootstrapTestController.Run(ctx, 1)
+	}
+	go unsupportedConfigOverridesController.Run(ctx, 1)

OCPBUGS-42810: actively move bootstrap member lead #1369

Are you sure you want to change the base?

OCPBUGS-42810: actively move bootstrap member lead #1369

Uh oh!

Conversation

tjungblu commented Nov 19, 2024

Uh oh!

openshift-ci-robot commented Nov 19, 2024

Uh oh!

tjungblu commented Nov 20, 2024

Uh oh!

openshift-ci bot commented Nov 20, 2024

Uh oh!

tjungblu commented Nov 25, 2024

Uh oh!

tjungblu commented Nov 27, 2024

Uh oh!

tjungblu commented Nov 28, 2024

Uh oh!

openshift-cherrypick-robot commented Nov 28, 2024

Uh oh!

tjungblu commented Nov 28, 2024

Uh oh!

openshift-ci bot commented Nov 28, 2024

Uh oh!

tjungblu commented Nov 28, 2024

Uh oh!

tjungblu commented Nov 28, 2024

Uh oh!

openshift-ci bot commented Nov 28, 2024

Uh oh!

tjungblu commented Dec 2, 2024

Uh oh!

openshift-ci bot commented Dec 2, 2024

Uh oh!

tjungblu commented Dec 3, 2024

Uh oh!

openshift-ci bot commented Dec 3, 2024

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

vrutkovs commented Jan 8, 2025

Uh oh!

vrutkovs commented Jan 8, 2025

Uh oh!

vrutkovs commented Jan 14, 2025

Uh oh!

tjungblu commented Sep 2, 2025

Uh oh!

openshift-ci bot commented Sep 2, 2025

Uh oh!

tjungblu commented Sep 3, 2025

Uh oh!

openshift-ci bot commented Sep 3, 2025

Uh oh!

tjungblu commented Sep 3, 2025

Uh oh!

openshift-ci bot commented Sep 3, 2025

Uh oh!

tjungblu commented Sep 4, 2025

Uh oh!

openshift-ci bot commented Sep 4, 2025

Uh oh!

tjungblu commented Sep 4, 2025

Uh oh!

openshift-ci bot commented Sep 4, 2025

Uh oh!

tjungblu commented Sep 5, 2025

Uh oh!

openshift-ci bot commented Sep 5, 2025

Uh oh!

tjungblu commented Sep 9, 2025

Uh oh!

openshift-ci bot commented Sep 9, 2025

Uh oh!

tjungblu commented Sep 10, 2025

Uh oh!

openshift-ci bot commented Sep 10, 2025

coderabbitai bot commented Sep 15, 2025 •

edited

Loading