K8SPSMDB-1451: Wait until primary elected after rs initialization#2149
K8SPSMDB-1451: Wait until primary elected after rs initialization#2149
Conversation
|
|
||
| out := strings.Trim(stdout.String(), "\n") | ||
| if out != "true" { | ||
| return errors.New("is not the writable primary") |
There was a problem hiding this comment.
| return errors.New("is not the writable primary") | |
| return errors.New(pod.Name+" is not the writable primary") |
There was a problem hiding this comment.
Pull request overview
This PR improves the replica set initialization process by replacing a blind 5-second wait after rs.initiate with an intelligent backoff retry mechanism that polls for primary election. The change makes the initialization process more reliable by actively checking when the primary is elected rather than assuming 5 seconds is always sufficient.
- Replaces fixed 5-second sleep with a backoff retry mechanism that actively checks for primary election
- Uses MongoDB's
hellocommand to verify the node is a writable primary - Adds exponential backoff with 5 retry steps and configurable timing parameters
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| Factor: 5.0, | ||
| Jitter: 0.1, | ||
| } | ||
| err = retry.OnError(backoff, func(err error) bool { return true }, func() error { |
There was a problem hiding this comment.
Are we going to also retry permanent errors, like auth, for example?
There was a problem hiding this comment.
i think yes. at this point there shouldn't be any auth errors any way. and in the worst case scenario we'll retry for 6 seconds.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
|
||
| hello := []string{"sh", "-c", | ||
| mongoCmd + " --quiet --eval 'db.hello().isWritablePrimary'"} | ||
| err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false) |
There was a problem hiding this comment.
Variable shadowing detected. The 'err' variable is declared with ':=' inside the retry function, which shadows the 'err' variable from the outer scope (line 736). This should use '=' instead of ':=' since 'err' is already declared in the outer scope. Variable shadowing can lead to confusion and potential bugs where the wrong error variable is used.
| err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false) | |
| err = r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false) |
| backoff := wait.Backoff{ | ||
| Steps: 5, | ||
| Duration: 50 * time.Millisecond, | ||
| Factor: 5.0, | ||
| Jitter: 0.1, | ||
| } | ||
| err = retry.OnError(backoff, func(err error) bool { return true }, func() error { | ||
| var stderr, stdout bytes.Buffer | ||
|
|
||
| hello := []string{"sh", "-c", | ||
| mongoCmd + " --quiet --eval 'db.hello().isWritablePrimary'"} | ||
| err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false) | ||
| if err != nil { | ||
| return errors.Wrapf(err, "run hello stdout: %s, stderr: %s", stdout.String(), stderr.String()) | ||
| } | ||
|
|
||
| out := strings.TrimSpace(stdout.String()) | ||
| if out != "true" { | ||
| return errors.Errorf("%s is not the writable primary", pod.Name) | ||
| } | ||
|
|
||
| log.Info(pod.Name+" is the writable primary", "replset", replsetName) | ||
|
|
||
| return nil | ||
| }) |
There was a problem hiding this comment.
The new backoff retry logic for waiting until a primary is elected lacks test coverage. Consider adding unit tests to verify the retry behavior, including successful primary election, timeout scenarios, and error handling. Other functions in mgo_test.go demonstrate the testing pattern used in this package.
commit: 1bb3527 |
CHANGE DESCRIPTION
Problem:
Operator blindly waits for 5 seconds after running
rs.initiate.Solution:
Check if primary is elected with backoff retry.
CHECKLIST
Jira
Needs Doc) and QA (Needs QA)?Tests
compare/*-oc.yml)?Config/Logging/Testability