Skip to content

K8SPSMDB-1296: improve readiness probe#1917

Merged
hors merged 42 commits intomainfrom
K8SPSMDB-1296
Jan 7, 2026
Merged

K8SPSMDB-1296: improve readiness probe#1917
hors merged 42 commits intomainfrom
K8SPSMDB-1296

Conversation

@pooknull
Copy link
Copy Markdown
Contributor

@pooknull pooknull commented May 12, 2025

K8SPSMDB-1296 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPSMDB-1296

DESCRIPTION

This PR improves readiness probe by verifying the stateStr field in the replSetGetStatus output. If it's not possible to execute the command, the readiness probe will not fail, because otherwise it wouldn't be possible to deploy a mongod statefulset. The readiness probe will fail if the value of the stateStr is not equal to Primary, Secondary or Arbiter

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label May 12, 2025
@github-actions github-actions bot added the tests label May 12, 2025
@pooknull pooknull marked this pull request as ready for review May 26, 2025 12:00
"github.com/percona/percona-server-mongodb-operator/pkg/psmdb/mongo"
)

func getStatus(ctx context.Context, client mongo.Client) (ReplSetStatus, error) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why don't we have all the mongo client-related functions together as part of the type Client interface? I understand that we are not committing to the interface segregation rule by doing that, but that interface is already containing everything (almost in terms of functionality).

Also the response type seems related to the generic mongo model and maybe can be moved to the mongo model file.

type ReplSetStatus struct {
...
}

This removes the need to have a utils file completely.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +60 to +62
if err != nil {
log.Error(err, "Failed to get replset status")
return nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wonder if we should ignore all errors or only this node is not a member of replset?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pooknull pooknull requested a review from valmiranogueira as a code owner May 27, 2025 05:21
@hors hors added this to the v1.21.0 milestone May 27, 2025
func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error {
func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error {
if rs.GetSelf() == nil {
return errors.New("invalid replset status")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is this error message right?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


func CheckState(rs ReplSetStatus, startupDelaySeconds int64, oplogSize int64) error {
func CheckState(rs mongo.Status, startupDelaySeconds int64, oplogSize int64) error {
if rs.GetSelf() == nil {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that on L126 we are using again rs.GetSelf, assigning here to a variable, then performing the nil check and then using it in the remaining function is better since that function is looping through the members and it is not needed for every invocation.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


var d net.Dialer

addr := cnf.Hosts[0]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we ensure that hosts are not empty/nil?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cnf.Timeout = time.Second
client, err := db.Dial(ctx, cnf)
if err != nil {
return nil, nil
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we swallowing this error?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

return &rs, nil
}()
if err != nil || s == nil {
return err
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's add wrap some context to this error, MongodReadinessCheck already returns multiple errors

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot AI review requested due to automatic review settings December 18, 2025 09:28
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 221 out of 221 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@pooknull pooknull requested a review from gkech December 18, 2025 10:32
Copy link
Copy Markdown
Contributor

@egegunes egegunes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please check #1917 (comment)

gkech
gkech previously approved these changes Dec 23, 2025
@pooknull pooknull requested a review from egegunes December 23, 2025 09:02
egegunes
egegunes previously approved these changes Dec 24, 2025
Copy link
Copy Markdown
Collaborator

@hors hors left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@pooknull please fix unsafe-psa test

Copilot AI review requested due to automatic review settings January 6, 2026 08:56
@pooknull pooknull dismissed stale reviews from egegunes and gkech via ea52e0c January 6, 2026 08:56
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 223 out of 223 changed files in this pull request and generated 1 comment.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

@hors hors self-requested a review January 6, 2026 10:24
@JNKPercona
Copy link
Copy Markdown
Collaborator

Test Name Result Time
arbiter passed 00:00:00
balancer passed 00:00:00
cross-site-sharded passed 00:00:00
custom-replset-name passed 00:00:00
custom-tls passed 00:00:00
custom-users-roles passed 00:00:00
custom-users-roles-sharded passed 00:00:00
data-at-rest-encryption passed 00:00:00
data-sharded passed 00:00:00
demand-backup passed 00:00:00
demand-backup-eks-credentials-irsa passed 00:00:00
demand-backup-fs passed 00:00:00
demand-backup-if-unhealthy passed 00:00:00
demand-backup-incremental passed 00:00:00
demand-backup-incremental-sharded passed 00:56:46
demand-backup-physical-parallel passed 00:00:00
demand-backup-physical-aws passed 00:00:00
demand-backup-physical-azure passed 00:00:00
demand-backup-physical-gcp-s3 passed 00:00:00
demand-backup-physical-gcp-native passed 00:00:00
demand-backup-physical-minio passed 00:00:00
demand-backup-physical-minio-native passed 00:00:00
demand-backup-physical-sharded-parallel passed 00:00:00
demand-backup-physical-sharded-aws passed 00:00:00
demand-backup-physical-sharded-azure passed 00:00:00
demand-backup-physical-sharded-gcp-native passed 00:00:00
demand-backup-physical-sharded-minio passed 00:00:00
demand-backup-physical-sharded-minio-native passed 00:00:00
demand-backup-sharded passed 00:00:00
expose-sharded passed 00:00:00
finalizer passed 00:00:00
ignore-labels-annotations passed 00:00:00
init-deploy passed 00:00:00
ldap passed 00:00:00
ldap-tls passed 00:00:00
limits passed 00:00:00
liveness passed 00:00:00
mongod-major-upgrade passed 00:00:00
mongod-major-upgrade-sharded passed 00:00:00
monitoring-2-0 passed 00:00:00
monitoring-pmm3 passed 00:00:00
multi-cluster-service passed 00:00:00
multi-storage passed 00:00:00
non-voting-and-hidden passed 00:00:00
one-pod passed 00:00:00
operator-self-healing-chaos passed 00:00:00
pitr passed 00:00:00
pitr-physical passed 00:00:00
pitr-sharded passed 00:00:00
pitr-to-new-cluster passed 00:00:00
pitr-physical-backup-source passed 00:00:00
preinit-updates passed 00:00:00
pvc-resize passed 00:00:00
recover-no-primary passed 00:00:00
replset-overrides passed 00:00:00
replset-remapping passed 00:00:00
replset-remapping-sharded passed 00:00:00
rs-shard-migration passed 00:00:00
scaling passed 00:00:00
scheduled-backup passed 00:00:00
security-context passed 00:00:00
self-healing-chaos passed 00:00:00
service-per-pod passed 00:00:00
serviceless-external-nodes passed 00:00:00
smart-update passed 00:00:00
split-horizon passed 00:00:00
stable-resource-version passed 00:00:00
storage passed 00:00:00
tls-issue-cert-manager passed 00:00:00
unsafe-psa passed 00:00:00
upgrade passed 00:00:00
upgrade-consistency passed 00:00:00
upgrade-consistency-sharded-tls passed 00:00:00
upgrade-sharded passed 00:00:00
upgrade-partial-backup passed 00:00:00
users passed 00:00:00
users-vault passed 00:00:00
version-service passed 00:00:00
Summary Value
Tests Run 78/78
Job Duration 01:54:25
Total Test Time 00:56:46

commit: ea52e0c
image: perconalab/percona-server-mongodb-operator:PR-1917-ea52e0c6

@hors hors merged commit ca05f46 into main Jan 7, 2026
12 of 13 checks passed
@hors hors deleted the K8SPSMDB-1296 branch January 7, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

7 participants