K8SPSMDB-1211: handle `FULL CLUSTER CRASH` error during the restore #1926

pooknull · 2025-05-16T12:58:41Z

https://perconadev.atlassian.net/browse/K8SPSMDB-1211

DESCRIPTION

Problem:
During the physical restore, the operator detects a FULL CLUSTER CRASH and attempts to resolve the issue. The operator log contains the FULL CLUSTER CRASH log message, which should not be logged because this error occurs 100% of the time during the physical restore.

Solution:
The solution is to perform the same action the (*ReconcilePerconaServerMongoDB) handleReplicaSetNoPrimary method does after the physical restore. Once PBM has finished the restore, the operator should recreate the statefulsets and add the percona.com/restore-in-progress annotation to them and handle the FULL CLUSTER CRASH state. Afterwards, the percona.com/restore-in-progress annotation should be removed from the statefulsets.

CHECKLIST

Jira

Is the Jira ticket created and referenced properly?
Does the Jira ticket have the proper statuses for documentation (Needs Doc) and QA (Needs QA)?
Does the Jira ticket link to the proper milestone (Fix Version field)?

Tests

Is an E2E test/test case added for the new feature/change?
Are unit tests added where appropriate?
Are OpenShift compare files changed for E2E tests (compare/*-oc.yml)?

Config/Logging/Testability

Are all needed new/changed options added to default YAML files?
Are all needed new/changed options added to the Helm Chart?
Did we add proper logging messages for operator actions?
Did we ensure compatibility with the previous version or cluster upgrade process?
Does the change support oldest and newest supported MongoDB version?
Does the change support oldest and newest supported Kubernetes version?

https://perconadev.atlassian.net/browse/K8SPSMDB-1211

gkech · 2025-05-20T07:27:13Z

pkg/controller/common/common.go

@@ -0,0 +1,70 @@
+package common


I think packages named common, utils, etc., tend to be vague, as they imply shared logic without a clearly defined domain or separation of concerns.

In this file, the main struct is CommonReconciler, but it's not clear what exactly is being reconciled. The struct also mixes responsibilities: as it's constructing and returning heterogeneous components like backup.PBM, mongo.Client, a scheme, and a k8s client.

To improve clarity and maintainability, I'd suggest:

Keeping the scheme and the Kubernetes client in ReconcilePerconaServerMongoDB, and having related function with receivers of type ReconcilePerconaServerMongoDB.

Splitting out PBM-related logic into a dedicated PBM factory/service.

Doing the same for the MongoClientProvider.

egegunes · 2025-06-02T08:20:26Z

pkg/controller/perconaservermongodbrestore/physical.go

+	finished, err := r.finishPhysicalRestore(ctx, cluster)
+	if err != nil {
+		log.Error(err, "Failed to recover the cluster after the restore")
+		// status.State = psmdbv1.RestoreStateReady


please remove commented code

egegunes · 2025-06-02T08:22:24Z

pkg/controller/perconaservermongodbrestore/physical.go

+					cfg, err := cli.ReadConfig(ctx)
+					if err != nil {
+						return errors.Wrap(err, "read replset config")
+					}
+
+					if err := cli.WriteConfig(ctx, cfg, true); err != nil {
+						return errors.Wrap(err, "reconfigure replset")
+					}


i don't get what we are doing here, we are reconfiguring replset but we are not changing anything in config?

We are doing the same thing as in the (*ReconcilePerconaServerMongoDB) handleReplicaSetNoPrimary:

percona-server-mongodb-operator/pkg/controller/perconaservermongodb/mgo.go

Lines 763 to 770 in 3db09f5

cfg, err := cli.ReadConfig(ctx)

if err != nil {

return errors.Wrap(err, "read replset config")

}

if err := cli.WriteConfig(ctx, cfg, true); err != nil {

return errors.Wrap(err, "reconfigure replset")

}

JNKPercona · 2025-07-23T21:05:00Z

Test name	Status
arbiter	failure
balancer	failure
cross-site-sharded	failure
custom-replset-name	failure
custom-tls	failure
custom-users-roles	failure
custom-users-roles-sharded	failure
data-at-rest-encryption	failure
data-sharded	failure
demand-backup	failure
demand-backup-eks-credentials-irsa	skipped
demand-backup-fs	skipped
demand-backup-incremental	skipped
demand-backup-incremental-sharded	skipped
demand-backup-physical-parallel	skipped
demand-backup-physical-aws	skipped
demand-backup-physical-azure	skipped
demand-backup-physical-gcp	skipped
demand-backup-physical-minio	skipped
demand-backup-physical-sharded-parallel	skipped
demand-backup-physical-sharded-aws	skipped
demand-backup-physical-sharded-azure	skipped
demand-backup-physical-sharded-gcp	skipped
demand-backup-physical-sharded-minio	skipped
demand-backup-sharded	skipped
expose-sharded	skipped
finalizer	skipped
ignore-labels-annotations	skipped
init-deploy	skipped
ldap	skipped
ldap-tls	skipped
limits	skipped
liveness	skipped
mongod-major-upgrade	skipped
mongod-major-upgrade-sharded	skipped
monitoring-2-0	skipped
monitoring-pmm3	skipped
multi-cluster-service	skipped
multi-storage	skipped
non-voting-and-hidden	skipped
one-pod	skipped
operator-self-healing-chaos	skipped
pitr	skipped
pitr-physical	skipped
pitr-sharded	skipped
pitr-physical-backup-source	skipped
preinit-updates	skipped
pvc-resize	skipped
recover-no-primary	skipped
replset-overrides	skipped
rs-shard-migration	skipped
scaling	skipped
scheduled-backup	skipped
security-context	skipped
self-healing-chaos	skipped
service-per-pod	skipped
serviceless-external-nodes	skipped
smart-update	skipped
split-horizon	skipped
stable-resource-version	skipped
storage	skipped
tls-issue-cert-manager	skipped
upgrade	skipped
upgrade-consistency	skipped
upgrade-consistency-sharded-tls	skipped
upgrade-sharded	skipped
users	skipped
version-service	skipped
We run 10 out of 68

commit: c15daba
image: perconalab/percona-server-mongodb-operator:PR-1926-c15dabad

K8SPSMDB-1211: handle FULL CLUSTER CRASH error during the restore

e2b7e97

https://perconadev.atlassian.net/browse/K8SPSMDB-1211

pull-request-size bot added the size/XXL 1000+ lines label May 16, 2025

pooknull added 2 commits May 19, 2025 08:42

Merge remote-tracking branch 'origin/main' into K8SPSMDB-1211

47073d9

remove unused comment

49cc044

pull-request-size bot added size/XL 500-999 lines and removed size/XXL 1000+ lines labels May 19, 2025

fix lint

19de9e6

gkech reviewed May 20, 2025

View reviewed changes

pooknull and others added 11 commits May 20, 2025 17:01

remove common reconciler

9bf2482

fix

879163f

fix unit-test

f87bcc8

fix

a566737

Merge remote-tracking branch 'origin/main' into K8SPSMDB-1211

20e0558

fix manifests

35a2e22

fix tests

81186a8

Merge branch 'main' into K8SPSMDB-1211

2614d85

small fix

dc8663f

Merge branch 'main' into K8SPSMDB-1211

0a442b9

add sleep

b433511

github-actions bot added the tests label May 23, 2025

pooknull and others added 5 commits May 23, 2025 14:28

fix tests

81d2898

Merge branch 'main' into K8SPSMDB-1211

3cd0736

wait after adding resync annotation

f9354c4

backoff wait after adding resync

788505c

remove wait and fix tests

915ffc8

hors added this to the v1.21.0 milestone May 27, 2025

pooknull added 4 commits May 27, 2025 15:07

Merge remote-tracking branch 'origin/main' into K8SPSMDB-1211

9f69da2

fix merge

aaa227b

fix manifests

82c139f

Merge remote-tracking branch 'origin/main' into K8SPSMDB-1211

050f5ef

fix merge

ae459c2

pull-request-size bot added size/XXL 1000+ lines and removed size/XL 500-999 lines labels May 28, 2025

fix merge

ced15a2

pull-request-size bot added size/XL 500-999 lines and removed size/XXL 1000+ lines labels May 28, 2025

pooknull added 2 commits May 28, 2025 15:39

fix arbiter

756aefe

Merge branch 'main' into K8SPSMDB-1211

6995236

pooknull marked this pull request as ready for review May 29, 2025 07:26

pooknull requested review from jvpasinatto, eleo007, valmiranogueira, hors, egegunes and nmarukovich as code owners May 29, 2025 07:26

pooknull requested a review from gkech May 29, 2025 07:27

Merge remote-tracking branch 'origin/main' into K8SPSMDB-1211

fd8c741

gkech previously approved these changes May 30, 2025

View reviewed changes

egegunes reviewed Jun 2, 2025

View reviewed changes

nmarukovich previously approved these changes Jun 2, 2025

View reviewed changes

gkech requested review from gkech and nmarukovich June 3, 2025 11:58

remove commented code

3f6285a

pooknull dismissed stale reviews from nmarukovich and gkech via 3f6285a June 9, 2025 11:46

pooknull and others added 3 commits June 9, 2025 15:02

fix lint

423d48d

Merge branch 'main' into K8SPSMDB-1211

d4d89df

Merge branch 'main' into K8SPSMDB-1211

c15daba

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

K8SPSMDB-1211: handle `FULL CLUSTER CRASH` error during the restore #1926

K8SPSMDB-1211: handle `FULL CLUSTER CRASH` error during the restore #1926

pooknull commented May 16, 2025 •

edited

Loading

Uh oh!

gkech May 20, 2025

Uh oh!

pooknull May 21, 2025

Uh oh!

egegunes Jun 2, 2025

Uh oh!

pooknull Jun 9, 2025

Uh oh!

egegunes Jun 2, 2025

Uh oh!

pooknull Jun 9, 2025

Uh oh!

JNKPercona commented Jul 23, 2025

Uh oh!

Uh oh!

	cfg, err := cli.ReadConfig(ctx)
	if err != nil {
	return errors.Wrap(err, "read replset config")
	}

	if err := cli.WriteConfig(ctx, cfg, true); err != nil {
	return errors.Wrap(err, "reconfigure replset")
	}

K8SPSMDB-1211: handle FULL CLUSTER CRASH error during the restore #1926

Are you sure you want to change the base?

K8SPSMDB-1211: handle FULL CLUSTER CRASH error during the restore #1926

Conversation

pooknull commented May 16, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

DESCRIPTION

CHECKLIST

Uh oh!

gkech May 20, 2025

Choose a reason for hiding this comment

Uh oh!

pooknull May 21, 2025

Choose a reason for hiding this comment

Uh oh!

egegunes Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

pooknull Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

egegunes Jun 2, 2025

Choose a reason for hiding this comment

Uh oh!

pooknull Jun 9, 2025

Choose a reason for hiding this comment

Uh oh!

JNKPercona commented Jul 23, 2025

Uh oh!

Uh oh!

K8SPSMDB-1211: handle `FULL CLUSTER CRASH` error during the restore #1926

K8SPSMDB-1211: handle `FULL CLUSTER CRASH` error during the restore #1926

pooknull commented May 16, 2025 •

edited

Loading