Skip to content

CLOUDP-389867: add delay to backupConfig sharded cluster#935

Draft
filipcirtog wants to merge 1 commit intomasterfrom
CLOUDP-389867/add-delay-to-backupConfig-sharded-cluster
Draft

CLOUDP-389867: add delay to backupConfig sharded cluster#935
filipcirtog wants to merge 1 commit intomasterfrom
CLOUDP-389867/add-delay-to-backupConfig-sharded-cluster

Conversation

@filipcirtog
Copy link
Collaborator

@filipcirtog filipcirtog commented Mar 24, 2026

HELP

HELP-87476 - This Jira ticket addresses a race condition occurring when enabling backups for a sharded cluster deployed with the Kubernetes operator. The issue arises as individual shards may not receive 'addShard' events, leading to their indefinite inactivity. Investigations are focused on identifying the race condition in Ops Manager and finding a solution to ensure all shards are included in backups without delay.

Summary

When enabling backup on a sharded cluster, Ops Manager needs time to complete its internal topology discovery before it can successfully accept a backup request. Without a delay, the operator races against OM's discovery, causing backup enablement to fail and triggering reconciliation loops until a retry eventually succeeds.

This race is specific to sharded clusters due to their multi-process topology (mongos + config servers + shards), which takes longer for OM to fully register compared to replica sets.

Proof of Work

A configurable sleep is inserted in updateOmDeploymentShardedCluster immediately before calling ensureBackupConfigurationAndUpdateStatus, but only when a backup spec is present. The delay defaults to 60 seconds and is controlled by the MDB_BACKUP_START_DELAY_SECONDS environment variable on the operator deployment, allowing users to tune or disable it per environment.

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

@github-actions
Copy link

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.7.1 Release Notes

Bug Fixes

  • MongoDBOpsManager: Correctly handle the edge case where -admin-key was created by user and malformed. Previously the error was only presented in DEBUG log entry.
  • MongoDBOpsManager: Improved readiness probe error handling and appDB agent status logging

Other Changes

  • Container images: Merged the init-database and init-appdb init container images into a single init-database image. The init-appdb image will no longer be published and does not affect existing deployments.
    • The following Helm chart values have been removed: initAppDb.name, initAppDb.version, and registry.initAppDb. Use initDatabase.name, initDatabase.version, and registry.initDatabase instead.
    • The following environment variables have been removed: INIT_APPDB_IMAGE_REPOSITORY and INIT_APPDB_VERSION. Use INIT_DATABASE_IMAGE_REPOSITORY and INIT_DATABASE_VERSION instead.
  • Helm Chart: Removed operator.baseName Helm value. This value was never intended to be consumed by operator users and was never documented. The value controls the prefix for workload RBAC resource names (mongodb-kubernetes default), but changing it could break the operator and workloads because the operator is not aware of custom prefixes. With this change, the Helm chart will no longer allow customisation and the relevant resources will be deployed with predefined names (ServiceAccount with names mongodb-kubernetes-appdb, mongodb-kubernetes-database-pods, mongodb-kubernetes-ops-manager, Role with name mongodb-kubernetes-appdb and RoleBinding with name mongodb-kubernetes-appdb).

@filipcirtog filipcirtog added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Mar 24, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant