Skip to content

K8SPSMDB-1211: handle FULL CLUSTER CRASH error during the restore #1926

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 33 commits into
base: main
Choose a base branch
from

Conversation

pooknull
Copy link
Contributor

@pooknull pooknull commented May 16, 2025

K8SPSMDB-1211 Powered by Pull Request Badge

https://perconadev.atlassian.net/browse/K8SPSMDB-1211

DESCRIPTION

Problem:
During the physical restore, the operator detects a FULL CLUSTER CRASH and attempts to resolve the issue. The operator log contains the FULL CLUSTER CRASH log message, which should not be logged because this error occurs 100% of the time during the physical restore.

Solution:
The solution is to perform the same action the (*ReconcilePerconaServerMongoDB) handleReplicaSetNoPrimary method does after the physical restore. Once PBM has finished the restore, the operator should recreate the statefulsets and add the percona.com/restore-in-progress annotation to them and handle the FULL CLUSTER CRASH state. Afterwards, the percona.com/restore-in-progress annotation should be removed from the statefulsets.

CHECKLIST

Jira

  • Is the Jira ticket created and referenced properly?
  • Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
  • Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

  • Is an E2E test/test case added for the new feature/change?
  • Are unit tests added where appropriate?
  • Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

  • Are all needed new/changed options added to default YAML files?
  • Are all needed new/changed options added to the Helm Chart?
  • Did we add proper logging messages for operator actions?
  • Did we ensure compatibility with the previous version or cluster upgrade process?
  • Does the change support oldest and newest supported MongoDB version?
  • Does the change support oldest and newest supported Kubernetes version?

@pull-request-size pull-request-size bot added the size/XXL 1000+ lines label May 16, 2025
@pull-request-size pull-request-size bot added size/XL 500-999 lines and removed size/XXL 1000+ lines labels May 19, 2025
@@ -0,0 +1,70 @@
package common
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think packages named common, utils, etc., tend to be vague, as they imply shared logic without a clearly defined domain or separation of concerns.

In this file, the main struct is CommonReconciler, but it's not clear what exactly is being reconciled. The struct also mixes responsibilities: as it's constructing and returning heterogeneous components like backup.PBM, mongo.Client, a scheme, and a k8s client.

To improve clarity and maintainability, I'd suggest:

  • Keeping the scheme and the Kubernetes client in ReconcilePerconaServerMongoDB, and having related function with receivers of type ReconcilePerconaServerMongoDB.

  • Splitting out PBM-related logic into a dedicated PBM factory/service.

  • Doing the same for the MongoClientProvider.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@github-actions github-actions bot added the tests label May 23, 2025
@hors hors added this to the v1.21.0 milestone May 27, 2025
@pull-request-size pull-request-size bot added size/XXL 1000+ lines and removed size/XL 500-999 lines labels May 28, 2025
@pull-request-size pull-request-size bot added size/XL 500-999 lines and removed size/XXL 1000+ lines labels May 28, 2025
@pooknull pooknull marked this pull request as ready for review May 29, 2025 07:26
@pooknull pooknull requested a review from gkech May 29, 2025 07:27
gkech
gkech previously approved these changes May 30, 2025
finished, err := r.finishPhysicalRestore(ctx, cluster)
if err != nil {
log.Error(err, "Failed to recover the cluster after the restore")
// status.State = psmdbv1.RestoreStateReady
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please remove commented code

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment on lines +392 to +399
cfg, err := cli.ReadConfig(ctx)
if err != nil {
return errors.Wrap(err, "read replset config")
}

if err := cli.WriteConfig(ctx, cfg, true); err != nil {
return errors.Wrap(err, "reconfigure replset")
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i don't get what we are doing here, we are reconfiguring replset but we are not changing anything in config?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We are doing the same thing as in the (*ReconcilePerconaServerMongoDB) handleReplicaSetNoPrimary:

cfg, err := cli.ReadConfig(ctx)
if err != nil {
return errors.Wrap(err, "read replset config")
}
if err := cli.WriteConfig(ctx, cfg, true); err != nil {
return errors.Wrap(err, "reconfigure replset")
}

nmarukovich
nmarukovich previously approved these changes Jun 2, 2025
@gkech gkech requested review from gkech and nmarukovich June 3, 2025 11:58
@pooknull pooknull dismissed stale reviews from nmarukovich and gkech via 3f6285a June 9, 2025 11:46
@JNKPercona
Copy link
Collaborator

Test name Status
arbiter failure
balancer failure
cross-site-sharded failure
custom-replset-name failure
custom-tls failure
custom-users-roles failure
custom-users-roles-sharded failure
data-at-rest-encryption failure
data-sharded failure
demand-backup failure
demand-backup-eks-credentials-irsa skipped
demand-backup-fs skipped
demand-backup-incremental skipped
demand-backup-incremental-sharded skipped
demand-backup-physical-parallel skipped
demand-backup-physical-aws skipped
demand-backup-physical-azure skipped
demand-backup-physical-gcp skipped
demand-backup-physical-minio skipped
demand-backup-physical-sharded-parallel skipped
demand-backup-physical-sharded-aws skipped
demand-backup-physical-sharded-azure skipped
demand-backup-physical-sharded-gcp skipped
demand-backup-physical-sharded-minio skipped
demand-backup-sharded skipped
expose-sharded skipped
finalizer skipped
ignore-labels-annotations skipped
init-deploy skipped
ldap skipped
ldap-tls skipped
limits skipped
liveness skipped
mongod-major-upgrade skipped
mongod-major-upgrade-sharded skipped
monitoring-2-0 skipped
monitoring-pmm3 skipped
multi-cluster-service skipped
multi-storage skipped
non-voting-and-hidden skipped
one-pod skipped
operator-self-healing-chaos skipped
pitr skipped
pitr-physical skipped
pitr-sharded skipped
pitr-physical-backup-source skipped
preinit-updates skipped
pvc-resize skipped
recover-no-primary skipped
replset-overrides skipped
rs-shard-migration skipped
scaling skipped
scheduled-backup skipped
security-context skipped
self-healing-chaos skipped
service-per-pod skipped
serviceless-external-nodes skipped
smart-update skipped
split-horizon skipped
stable-resource-version skipped
storage skipped
tls-issue-cert-manager skipped
upgrade skipped
upgrade-consistency skipped
upgrade-consistency-sharded-tls skipped
upgrade-sharded skipped
users skipped
version-service skipped
We run 10 out of 68

commit: c15daba
image: perconalab/percona-server-mongodb-operator:PR-1926-c15dabad

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/XL 500-999 lines tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants