Skip to content

K8SPSMDB-1451: Wait until primary elected after rs initialization#2149

Merged
hors merged 8 commits intomainfrom
K8SPSMDB-1451
Jan 5, 2026
Merged

K8SPSMDB-1451: Wait until primary elected after rs initialization#2149
hors merged 8 commits intomainfrom
K8SPSMDB-1451

Conversation

@egegunes
Copy link
Copy Markdown
Contributor

@egegunes egegunes commented Dec 16, 2025

K8SPSMDB-1451 Powered by Pull Request Badge

CHANGE DESCRIPTION

Problem:
Operator blindly waits for 5 seconds after running rs.initiate.

Solution:
Check if primary is elected with backoff retry.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@egegunes egegunes added this to the v1.22.0 milestone Dec 16, 2025
@egegunes egegunes requested a review from hors as a code owner December 16, 2025 12:24
Copilot AI review requested due to automatic review settings December 16, 2025 12:24
@pull-request-size pull-request-size bot added the size/M 30-99 lines label Dec 16, 2025

out := strings.Trim(stdout.String(), "\n")
if out != "true" {
return errors.New("is not the writable primary")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return errors.New("is not the writable primary")
return errors.New(pod.Name+" is not the writable primary")

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves the replica set initialization process by replacing a blind 5-second wait after rs.initiate with an intelligent backoff retry mechanism that polls for primary election. The change makes the initialization process more reliable by actively checking when the primary is elected rather than assuming 5 seconds is always sufficient.

  • Replaces fixed 5-second sleep with a backoff retry mechanism that actively checks for primary election
  • Uses MongoDB's hello command to verify the node is a writable primary
  • Adds exponential backoff with 5 retry steps and configurable timing parameters

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mayankshah1607
mayankshah1607 previously approved these changes Dec 18, 2025
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 18, 2025 12:30
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 3 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings December 18, 2025 12:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

mayankshah1607
mayankshah1607 previously approved these changes Dec 19, 2025
Factor: 5.0,
Jitter: 0.1,
}
err = retry.OnError(backoff, func(err error) bool { return true }, func() error {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we going to also retry permanent errors, like auth, for example?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think yes. at this point there shouldn't be any auth errors any way. and in the worst case scenario we'll retry for 6 seconds.

Copilot AI review requested due to automatic review settings January 5, 2026 05:45
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.


hello := []string{"sh", "-c",
mongoCmd + " --quiet --eval 'db.hello().isWritablePrimary'"}
err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false)
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Variable shadowing detected. The 'err' variable is declared with ':=' inside the retry function, which shadows the 'err' variable from the outer scope (line 736). This should use '=' instead of ':=' since 'err' is already declared in the outer scope. Variable shadowing can lead to confusion and potential bugs where the wrong error variable is used.

Suggested change
err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false)
err = r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false)

Copilot uses AI. Check for mistakes.
Comment on lines +730 to +754
backoff := wait.Backoff{
Steps: 5,
Duration: 50 * time.Millisecond,
Factor: 5.0,
Jitter: 0.1,
}
err = retry.OnError(backoff, func(err error) bool { return true }, func() error {
var stderr, stdout bytes.Buffer

hello := []string{"sh", "-c",
mongoCmd + " --quiet --eval 'db.hello().isWritablePrimary'"}
err := r.clientcmd.Exec(ctx, &pod, "mongod", hello, nil, &stdout, &stderr, false)
if err != nil {
return errors.Wrapf(err, "run hello stdout: %s, stderr: %s", stdout.String(), stderr.String())
}

out := strings.TrimSpace(stdout.String())
if out != "true" {
return errors.Errorf("%s is not the writable primary", pod.Name)
}

log.Info(pod.Name+" is the writable primary", "replset", replsetName)

return nil
})
Copy link

Copilot AI Jan 5, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The new backoff retry logic for waiting until a primary is elected lacks test coverage. Consider adding unit tests to verify the retry behavior, including successful primary election, timeout scenarios, and error handling. Other functions in mgo_test.go demonstrate the testing pattern used in this package.

Copilot uses AI. Check for mistakes.
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
arbiter passed 00:11:22
balancer passed 00:18:04
cross-site-sharded passed 00:18:17
custom-replset-name passed 00:10:16
custom-tls passed 00:13:46
custom-users-roles passed 00:10:16
custom-users-roles-sharded passed 00:11:16
data-at-rest-encryption passed 00:12:47
data-sharded passed 00:22:15
demand-backup passed 00:15:20
demand-backup-eks-credentials-irsa passed 00:00:06
demand-backup-fs passed 00:22:19
demand-backup-if-unhealthy passed 00:08:26
demand-backup-incremental passed 00:42:24
demand-backup-incremental-sharded passed 00:57:33
demand-backup-physical-parallel passed 00:07:56
demand-backup-physical-aws passed 00:11:49
demand-backup-physical-azure passed 00:11:50
demand-backup-physical-gcp-s3 passed 00:10:43
demand-backup-physical-gcp-native passed 00:11:22
demand-backup-physical-minio passed 00:20:04
demand-backup-physical-minio-native passed 00:19:37
demand-backup-physical-sharded-parallel passed 00:10:28
demand-backup-physical-sharded-aws passed 00:18:13
demand-backup-physical-sharded-azure passed 00:17:10
demand-backup-physical-sharded-gcp-native passed 00:17:23
demand-backup-physical-sharded-minio passed 00:17:22
demand-backup-physical-sharded-minio-native passed 00:16:39
demand-backup-sharded passed 00:25:00
expose-sharded passed 00:33:10
finalizer passed 00:09:35
ignore-labels-annotations passed 00:07:30
init-deploy passed 00:12:40
ldap passed 00:08:40
ldap-tls passed 00:12:35
limits passed 00:06:02
liveness passed 00:08:05
mongod-major-upgrade passed 00:11:41
mongod-major-upgrade-sharded passed 00:20:30
monitoring-2-0 passed 00:24:36
monitoring-pmm3 passed 00:26:08
multi-cluster-service passed 00:14:46
multi-storage passed 00:18:33
non-voting-and-hidden passed 00:15:15
one-pod passed 00:07:41
operator-self-healing-chaos passed 00:12:16
pitr passed 00:31:19
pitr-physical passed 01:00:36
pitr-sharded passed 00:20:44
pitr-to-new-cluster passed 00:24:29
pitr-physical-backup-source passed 00:53:10
preinit-updates passed 00:05:14
pvc-resize passed 00:13:12
recover-no-primary passed 00:25:57
replset-overrides passed 00:16:10
replset-remapping passed 00:08:27
replset-remapping-sharded passed 00:17:01
rs-shard-migration passed 00:13:33
scaling passed 00:10:40
scheduled-backup passed 00:16:30
security-context passed 00:07:05
self-healing-chaos passed 00:14:41
service-per-pod passed 00:18:31
serviceless-external-nodes passed 00:07:20
smart-update passed 00:07:47
split-horizon passed 00:07:42
stable-resource-version passed 00:04:29
storage passed 00:07:19
tls-issue-cert-manager passed 00:29:33
unsafe-psa passed 00:07:02
upgrade passed 00:09:36
upgrade-consistency passed 00:06:09
upgrade-consistency-sharded-tls passed 00:52:16
upgrade-sharded passed 00:19:28
upgrade-partial-backup passed 00:14:39
users passed 00:17:08
users-vault passed 00:13:18
version-service passed 00:24:30
Summary Value
Tests Run 78/78
Job Duration 04:03:09
Total Test Time 22:06:02

commit: 1bb3527
image: perconalab/percona-server-mongodb-operator:PR-2149-1bb35277

@hors hors merged commit 65f5c6f into main Jan 5, 2026
13 of 15 checks passed
@hors hors deleted the K8SPSMDB-1451 branch January 5, 2026 10:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/M 30-99 lines

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants