-
Notifications
You must be signed in to change notification settings - Fork 147
OCPBUGS-42810: actively move bootstrap member lead #1369
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is invalid:
Comment The bug has been updated to refer to the pull request using the external bug tracker. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
5720e20
to
a747960
Compare
/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/ce14d7c0-a741-11ef-9f14-d84ba44f8a80-0 |
a747960
to
0c031aa
Compare
/retest |
1 similar comment
/retest |
0c031aa
to
369b1cd
Compare
/cherry-pick release-4.18 |
@tjungblu: once the present PR merges, I will cherry-pick it on top of In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/01c7f930-ada0-11ef-8674-28fc71df2147-0 |
/retest-required |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/e6fa4930-adab-11ef-8910-6373b3a73b44-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/37cd8410-b0a6-11ef-8f86-550f1a2ac5ba-0 |
/retest /payload-aggregate periodic-ci-openshift-release-master-ci-4.18-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/63892fc0-b180-11ef-99d0-e245fa792cd0-0 |
err = c.removeBootstrap(timeoutCtx, safeToRemoveBootstrap, hasBootstrap, bootstrapID) | ||
if hasBootstrap { | ||
if err := c.ensureBootstrapIsNotLeader(ctx, bootstrapMember); err != nil { | ||
klog.Errorf("error while ensuring bootstrap is not leader: %v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we return err here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not really, I wouldn't want the controller to degrade when that operation fails once. Mind you, this happens very very rarely and affects our CI pass rates in very few cases of hundreds.
/retest |
/test e2e-metal-ovn-ha-cert-rotation-shutdown |
/test e2e-metal-ovn-ha-cert-rotation-shutdown e2e-metal-ovn-sno-cert-rotation-shutdown |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9aae7ad0-87e2-11f0-8e36-9df58d5d7d07-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/d13fd050-888a-11f0-8776-036a5a213b71-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/4b6983f0-88c0-11f0-9736-1b2d461ebfab-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 10 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/69083890-8954-11f0-9223-cf9f15e3c9ba-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/7e345990-897e-11f0-8451-b91fd76131ee-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-aws-ovn-upgrade 20 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/1a7dde20-8a3d-11f0-9f46-8138e528642f-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.21-e2e-azure-ovn-upgrade 20 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/94d60770-8d75-11f0-953c-decb559445d5-0 |
/payload-aggregate periodic-ci-openshift-release-master-ci-4.20-e2e-azure-ovn-upgrade 20 |
@tjungblu: trigger 1 job(s) for the /payload-(with-prs|job|aggregate|job-with-prs|aggregate-with-prs) command
See details on https://pr-payload-tests.ci.openshift.org/runs/ci/9c92dad0-8e3b-11f0-99df-a62f065d382a-0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some nits and questions for my own sake but seems good overall!
pkg/operator/ceohelpers/leader.go
Outdated
return false, err | ||
} | ||
|
||
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Super pedantic nit since you're already fixing spelling mistakes in this PR
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) | |
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also while we're here, does MoveLeader() == nil
mean the leadership was actually transferred, or we've only made a request to the leader, and it may still be in the process of stepping down or we're in the midst of an election.
Asking since the warning here says Moving
instead of Moved
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay so it seems like it waits for the actual transfer to finish and not just a shallow request.
cluster-etcd-operator/vendor/go.etcd.io/etcd/server/v3/etcdserver/server.go
Lines 1513 to 1520 in 61b7fb5
s.r.TransferLeadership(ctx, lead, transferee) | |
for s.Lead() != transferee { | |
select { | |
case <-ctx.Done(): // time out | |
return ErrTimeoutLeaderTransfer | |
case <-time.After(interval): | |
} | |
} |
In that case wdyt about the past tense:
klog.Warningf("Moving lead from member [%x] (%s) to [%x] (%s) succesfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) | |
klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yep, sounds good! thanks for proof reading this. I don't really understand why my spell checker in Goland picks up on the bOOstrap but not on the success 👍
pkg/operator/ceohelpers/leader.go
Outdated
} | ||
|
||
if otherMember == nil { | ||
return false, fmt.Errorf("no follower member found for the members: %v", memberList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some more context on the operation in case this ever does happen.
return false, fmt.Errorf("no follower member found for the members: %v", memberList) | |
return false, fmt.Errorf("no follower member found for leadership transfer: %v", memberList) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you, applied!
// if we have just moved it, we will skip this sync iteration to backoff the controller - the next resync will happen after a minute anyway | ||
if moved { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems okay so not disagreeing here, but for my knowledge what is the potential negative impact of removing the member right after we transfer?
At first I thought we're waiting for the transfer to happen but as I found out below MoveLeader()
seems to be synchronous.
If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.
Or just so we get a chance to update the status first.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we are indeed waiting for something to stabilize then just thinking if it's possible that may not happen by the next resync.
great question: what are we waiting on here. My intuition was to avoid disrupting the API too much, the clients will get the LeaderChangedErr (Error from server: etcdserver: leader changed
) on the move. Then potentially removing the member could also disrupt it further, e.g. during a static pod rollout.
I know this is highly hypothetical...
Unfortunately, I do not have any data yet on whether this would be really necessary. It is just too damn hard to reproduce this case.
Let me quickly put together a controller that will always make the bootstrap node the leader during the bootstrapping phase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here's a run with this controller, you can grep the logs with TEST_ONLY
it also seems there was no actual move being done before removing, even though we have moved the leader to the bootstrap node a minute earlier.
it still kinda shows what I was worried about:
I0915 13:00:26.367586 1 event.go:377] Event(v1.ObjectReference{Kind:"Deployment", Namespace:"openshift-etcd-operator", Name:"etcd-operator", UID:"6e338218-3237-4f73-9c33-f83a5f40b8a0", APIVersion:"apps/v1", ResourceVersion:"", FieldPath:""}): type: 'Normal' reason: 'Bootstrap member removed' successfully removed bootstrap member [32326af0e011bf0a]
E0915 13:00:26.377135 1 event.go:359] "Server rejected event (will not retry!)" err="rpc error: code = Unknown desc = raft: stopped" event="&Event{ObjectMeta:{etcd-operator.1865763b3da4743d openshift-etcd-operator 0 0001-01-01 00:00:00 +0000 UTC map[] map[] [] []
it seems that this is the case of scaling from 4->3, but during a static pod rollout, so that breaks quorum and causes raft to stop.
Anyway, I need a bit more time to test this, so I'll pick this up again when I have some time left with RIT :)
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dusk125, hasbro17, tjungblu The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
/retest-required |
This PR will actively try to move the leadership away from the bootstrap member to another member. Signed-off-by: Thomas Jungblut <[email protected]>
994ce3f
to
42628ec
Compare
WalkthroughAdds leader movement capability and exposes etcd member status responses. Updates etcd client interfaces and fake implementations, introduces leadership helper utilities and tests, refactors bootstrap teardown to avoid removing a leader, adds a test-only bootstrap leadership controller, and wires it into operator startup. Changes
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Pre-merge checks and finishing touches❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✨ Finishing touches
🧪 Generate unit tests
Tip 👮 Agentic pre-merge checks are now available in preview!Pro plan users can now enable pre-merge checks in their settings to enforce checklists before merging PRs.
Please see the documentation for more information. Example: reviews:
pre_merge_checks:
custom_checks:
- name: "Undocumented Breaking Changes"
mode: "warning"
instructions: |
Pass/fail criteria: All breaking changes to public APIs, CLI flags, environment variables, configuration keys, database schemas, or HTTP/GraphQL endpoints must be documented in the "Breaking Change" section of the PR description and in CHANGELOG.md. Exclude purely internal or private changes (e.g., code not exported from package entry points or explicitly marked as internal). Please share your feedback with us on this Discord post. Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
/jira refresh |
@tjungblu: This pull request references Jira Issue OCPBUGS-42810, which is valid. The bug has been moved to the POST state. 3 validation(s) were run on this bug
Requesting review from QA contact: In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the openshift-eng/jira-lifecycle-plugin repository. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actionable comments posted: 3
♻️ Duplicate comments (1)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)
154-174
: Leadership move and backoff strategy looks goodThe implementation correctly:
- Ensures bootstrap is not the leader before removal
- Backs off for a resync cycle after moving leadership
- Records appropriate events and conditions
This aligns with the discussion about avoiding API disruption during leadership changes.
🧹 Nitpick comments (8)
pkg/etcdcli/helpers.go (3)
108-114
: Consider adding member existence validation in MoveLeader.The method updates the leader for all statuses without verifying if
toMember
actually exists in the cluster. This could lead to an inconsistent state in tests.Add validation to ensure the target member exists:
func (f *fakeEtcdClient) MoveLeader(ctx context.Context, toMember uint64) error { + // Validate that the target member exists + memberExists := false + for _, m := range f.members { + if m.ID == toMember { + memberExists = true + break + } + } + if !memberExists { + return fmt.Errorf("target member %d does not exist", toMember) + } + for _, status := range f.opts.status { status.Leader = toMember } return nil }
173-185
: Potential inconsistency in default leader assignment.When no status is found for a member, the method returns a default
StatusResponse
with the member set as its own leader. This might not reflect the actual cluster state and could lead to incorrect leader detection logic.Consider returning an error or setting the leader to 0 (no leader) instead:
func (f *fakeEtcdClient) MemberStatus(ctx context.Context, member *etcdserverpb.Member) (*clientv3.StatusResponse, error) { // Find the status for this member for _, status := range f.opts.status { if status.Header != nil && status.Header.MemberId == member.ID { return status, nil } } - // Return a default status if none found - return &clientv3.StatusResponse{ - Header: &etcdserverpb.ResponseHeader{MemberId: member.ID}, - Leader: member.ID, // Default to self as leader - }, nil + // Return an error if status not found + return nil, fmt.Errorf("no status found for member %d", member.ID) }
272-278
: WithLeader doesn't initialize status headers.The
WithLeader
option modifies existing statuses but doesn't ensure they have proper headers. This could cause nil pointer issues when the statuses are later accessed.Consider initializing headers when setting the leader:
func WithLeader(leader uint64) FakeClientOption { return func(fo *FakeClientOptions) { for _, status := range fo.status { + if status.Header == nil { + status.Header = &etcdserverpb.ResponseHeader{} + } status.Leader = leader } } }pkg/operator/ceohelpers/leader.go (2)
31-52
: Consider adding retry logic for transient failures.The
MoveLeader
operation might fail due to transient network issues or temporary unavailability. Consider implementing retry logic with exponential backoff for production resilience.Also, the log message at line 50 should use
Info
level instead ofWarning
since this is a successful operation:- klog.Warningf("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0]) + klog.Infof("Moved lead from member [%x] (%s) to [%x] (%s) successfully!", leader.ID, leader.GetClientURLs()[0], otherMember.ID, otherMember.GetClientURLs()[0])
34-39
: Consider a more intelligent follower selection strategy.Currently, the function selects the first available follower. In production, you might want to consider factors like member health, load, or geographic distribution.
For better load distribution and resilience, consider:
- Checking member health status before selection
- Implementing round-robin or least-recently-used selection
- Avoiding members that are learners or unhealthy
This would help ensure leadership moves to the most suitable member rather than just the first available one.
pkg/operator/bootstraptest/bootstrap_lead_test_controller.go (1)
35-66
: Consider adding metrics for test controller operationsThis test controller modifies cluster state by moving leadership. For debugging and observability, consider adding metrics to track:
- Number of leadership move attempts
- Success/failure rates
- Time taken for leadership moves
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (2)
105-107
: Typo in function nameThe function name has a typo:
setSuccessfulBoostrapRemovalStatus
should besetSuccessfulBootstrapRemovalStatus
.This is already fixed in the code (line 105 shows the correct spelling), but ensure consistency throughout.
272-273
: Consider validating ClientURLs is not emptyThe code accesses
bootstrapMember.GetClientURLs()[0]
without checking if the slice is non-empty. While this should always be populated in practice, adding a defensive check would prevent potential panics.Apply this diff to add a defensive check:
-klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, bootstrapMember.GetClientURLs()[0]) +clientURL := "unknown" +if len(bootstrapMember.GetClientURLs()) > 0 { + clientURL = bootstrapMember.GetClientURLs()[0] +} +klog.Warningf("Bootstrap member [%x] (%s) detected as leader, trying to move elsewhere...", bootstrapMember.ID, clientURL)
📜 Review details
Configuration used: CodeRabbit UI
Review profile: CHILL
Plan: Pro
Cache: Disabled due to data retention organization setting
Knowledge base: Disabled due to Reviews -> Disable Knowledge Base
setting
📒 Files selected for processing (9)
pkg/etcdcli/etcdcli.go
(2 hunks)pkg/etcdcli/helpers.go
(3 hunks)pkg/etcdcli/interfaces.go
(3 hunks)pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go
(7 hunks)pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go
(7 hunks)pkg/operator/bootstraptest/bootstrap_lead_test_controller.go
(1 hunks)pkg/operator/ceohelpers/leader.go
(1 hunks)pkg/operator/ceohelpers/leader_test.go
(1 hunks)pkg/operator/starter.go
(3 hunks)
🔇 Additional comments (9)
pkg/etcdcli/interfaces.go (3)
17-33
: LGTM! Clean interface design.The addition of
LeaderMover
to theEtcdClient
interface is well-structured and follows the existing pattern of composing interfaces.
68-70
: LGTM! Well-designed leader movement interface.The
LeaderMover
interface is clean and focused with a clear single responsibility.
93-96
: LGTM! Improved MemberStatus signature.The updated signature returning
(*clientv3.StatusResponse, error)
provides more detailed information and better aligns with etcd's client API patterns.pkg/etcdcli/etcdcli.go (2)
267-279
: LGTM! Clean implementation of MoveLeader.The implementation follows the established pattern with proper timeout handling and resource cleanup.
460-474
: LGTM! Improved MemberStatus implementation.The refactored implementation is cleaner and more direct, eliminating the previous status mapping logic in favor of returning the actual etcd status response.
pkg/operator/ceohelpers/leader.go (1)
12-29
: LGTM! Well-structured leader detection logic.The function properly validates consistency across members and handles errors appropriately.
pkg/operator/ceohelpers/leader_test.go (1)
1-315
: Comprehensive test coverage!Excellent test implementation with thorough coverage of edge cases including:
- Various leader positions in the member list
- Single-member clusters
- Empty and nil member lists
- Error handling scenarios
- Inconsistent leader reporting
The table-driven approach with clear test names makes the tests maintainable and easy to understand.
pkg/operator/bootstrapteardown/bootstrap_teardown_controller_test.go (1)
375-389
: LGTM: Leader movement test case properly validates the new behaviorThe new test case "safe, has bootstrap, moving leader, not removing the bootstrap member" correctly validates that:
- The bootstrap member is not removed when it's the leader
- The appropriate conditions are set (
conditionEnoughEtcdMembers
andconditionBootstrapLeaderMoved
)- No events are emitted (0 expected events)
pkg/operator/bootstrapteardown/bootstrap_teardown_controller.go (1)
257-274
: Solid implementation of leader check and movementThe
ensureBootstrapIsNotLeader
method properly:
- Lists members to get current state
- Finds the current leader
- Only attempts to move leadership if bootstrap is the leader
- Delegates to the helper function for the actual move
The error handling is comprehensive and the logging provides good observability.
if hasBootstrap { | ||
klog.Warningf("TEST_ONLY moving leader to the bootstrap member") | ||
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members) | ||
if err != nil { | ||
return err | ||
} | ||
if moved { | ||
klog.Warningf("TEST_ONLY successfully moved to the bootstrap member") | ||
} else { | ||
klog.Warningf("TEST_ONLY failed to move to the bootstrap member") | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Incorrect leader movement logic detected
The message at line 53 says "moving leader to the bootstrap member", but the code at line 54 calls MoveLeaderToAnotherMember
which moves leadership away from the bootstrap member. The log messages are contradictory to the actual behavior.
Apply this diff to fix the log messages:
if hasBootstrap {
- klog.Warningf("TEST_ONLY moving leader to the bootstrap member")
+ klog.Warningf("TEST_ONLY moving leader away from the bootstrap member")
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members)
if err != nil {
return err
}
if moved {
- klog.Warningf("TEST_ONLY successfully moved to the bootstrap member")
+ klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member")
} else {
- klog.Warningf("TEST_ONLY failed to move to the bootstrap member")
+ klog.Warningf("TEST_ONLY failed to move away from the bootstrap member")
}
}
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
if hasBootstrap { | |
klog.Warningf("TEST_ONLY moving leader to the bootstrap member") | |
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members) | |
if err != nil { | |
return err | |
} | |
if moved { | |
klog.Warningf("TEST_ONLY successfully moved to the bootstrap member") | |
} else { | |
klog.Warningf("TEST_ONLY failed to move to the bootstrap member") | |
} | |
if hasBootstrap { | |
klog.Warningf("TEST_ONLY moving leader away from the bootstrap member") | |
moved, err := ceohelpers.MoveLeaderToAnotherMember(ctx, c.etcdClient, bootstrapMember, members) | |
if err != nil { | |
return err | |
} | |
if moved { | |
klog.Warningf("TEST_ONLY successfully moved away from the bootstrap member") | |
} else { | |
klog.Warningf("TEST_ONLY failed to move away from the bootstrap member") | |
} | |
} |
🤖 Prompt for AI Agents
In pkg/operator/bootstraptest/bootstrap_lead_test_controller.go around lines 52
to 62, the log messages incorrectly state that the leader is being moved "to the
bootstrap member" while the code calls MoveLeaderToAnotherMember (which moves
the leader away from the bootstrap member); update the three klog.Warningf
message strings to accurately reflect the action (e.g., "TEST_ONLY moving leader
away from the bootstrap member", "TEST_ONLY successfully moved leader away from
the bootstrap member", "TEST_ONLY failed to move leader away from the bootstrap
member") so the logs match the actual behavior.
// TODO(thomas): TEST ONLY, REMOVE AGAIN | ||
bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💡 Verification agent
🧩 Analysis chain
Test-only controller in production code needs removal plan.
The bootstrap test controller is marked as "TEST ONLY" but is being started in production code. This could have unintended side effects in production environments.
Please ensure this controller:
- Has appropriate feature gates or environment checks to prevent it from running in production
- Has a clear removal timeline and tracking issue
Consider gating this behind a test environment check:
- // TODO(thomas): TEST ONLY, REMOVE AGAIN
- bootstrapTestController := bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)
+ // TODO(thomas): TEST ONLY, REMOVE AGAIN - Track in issue #XXX
+ var bootstrapTestController factory.Controller
+ if os.Getenv("ENABLE_TEST_CONTROLLERS") == "true" {
+ bootstrapTestController = bootstraptest.NewBootstrapTestController(etcdClient, controllerContext.EventRecorder)
+ }
Let me verify what this test controller does:
🏁 Script executed:
#!/bin/bash
# Check the implementation of the bootstrap test controller
fd -t f "bootstrap.*test.*controller\.go" --exec cat {}
Length of output: 1708
Critical: remove or gate TEST-ONLY bootstrap controller that moves etcd leader
The controller is instantiated unconditionally in pkg/operator/starter.go (around lines 471–473). Its sync() detects a member named "etcd-bootstrap" and calls ceohelpers.MoveLeaderToAnotherMember — this will attempt to change cluster leadership and must not run in production.
- Do not run this controller in production: remove it or gate it behind an explicit, default‑off mechanism (build tag, operator feature‑gate, or an env var such as ENABLE_TEST_CONTROLLERS="true").
- Add a tracked issue and TODO with a concrete removal timeline (reference the issue in the code).
- If kept for local tests only, ensure it is only registered by test harness/startup logic and add CI checks that it is disabled in release images.
Locations: pkg/operator/starter.go (call site), bootstraptest package (sync implementation that calls MoveLeaderToAnotherMember).
// TODO(thomas): remove again | ||
go bootstrapTestController.Run(ctx, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Test controller started unconditionally.
The test controller is being started without any conditional checks, meaning it will run in all environments including production.
Guard the controller execution:
go bootstrapTeardownController.Run(ctx, 1)
- // TODO(thomas): remove again
- go bootstrapTestController.Run(ctx, 1)
+ // TODO(thomas): remove again - Track in issue #XXX
+ if bootstrapTestController != nil {
+ go bootstrapTestController.Run(ctx, 1)
+ }
go unsupportedConfigOverridesController.Run(ctx, 1)
📝 Committable suggestion
‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.
// TODO(thomas): remove again | |
go bootstrapTestController.Run(ctx, 1) | |
go bootstrapTeardownController.Run(ctx, 1) | |
// TODO(thomas): remove again - Track in issue #XXX | |
if bootstrapTestController != nil { | |
go bootstrapTestController.Run(ctx, 1) | |
} | |
go unsupportedConfigOverridesController.Run(ctx, 1) |
🤖 Prompt for AI Agents
In pkg/operator/starter.go around lines 647-648 the test controller is started
unconditionally (go bootstrapTestController.Run(ctx, 1)), which will run in all
environments; wrap this call in a guard so it only runs in non-production/test
environments (e.g., check a config flag, environment variable like
ENABLE_TEST_CONTROLLER or an existing runtime mode/IsDev flag), remove or honor
the TODO, and log when the test controller is enabled/disabled; ensure the
default for production is disabled so the controller is not started unless
explicitly enabled.
@tjungblu: The following tests failed, say
Full PR test history. Your PR dashboard. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
This PR will actively try to move the leadership away from the bootstrap member to another healthy member.
/hold