Skip to content

Commit 5bbf035

Browse files
Julien-Benlsierant
authored andcommitted
CLOUDP-288551: Block scaling both ways at runtime (#4099)
# Summary This PR blocks reconciliation in case we attempt to scale components up and down at the same time (both ways). When changing member distribution of a component, they should only scale down, or up. A component can be a shard, config servers or mongos E.g (for a distribution of a component over 3 clusters): - Scaling from 1 2 2 to 3 3 2 would be valid - Scaling from 2 2 2 to 1 2 3 would be invalid, as the component is scaling from 1 to 2 members on cluster 1, but from 2 to 3 on cluster 3. Multiple cases are illustrated in unit tests. We need to be defensive in these cases, because it can create inconsistencies in the behaviour of the operator. For example when we decide if we should publish the automation config first, we check at the cluster level, so we can detect that we scale down, whereas the global replica set (across all clusters) is scaling up. This change will also impact single cluster sharded clusters (e.g scaling up shards, but scaling down config servers) **Note:** we chose to do this check at runtime rather than in webhooks. Doing it at the webhook level would require to duplicate a lot of logic from the reconciler, to compute the desired configuration based on shard overrides, member config... With the scalers available, it is much easier and makes more sense to do it at runtime. ## List of changes - New method in sharded reconciler to block reconciliation if needed. - Fixed a bug which might cause losing the current sizes from the deployment state when upgrading the operator from <=1.27 (10gen/ops-manager-kubernetes@0fa2d50) - Unit test for this mechanism in various scaling scenarios. - Modified the `MultiClusterReplicaSetScaler` interface, to avoid type casting. - Fixed E2E tests that were attempting both ways reconciliation. - Added a constant for "slaney", the default name of a sharded cluster in unit tests. ## Documentation This ticket will be closed with documentation changes required to update our public documentation. ## Proof of Work Unit test `TestBlockReconcileScalingBothWays` Example error message: ``` Cannot perform scale up and scale down operations at the same time. Scaling Up: [Component=shard idx 0, Cluster=__default, Current=0, Desired=2;], Scaling Down: [Component=configSrv, Cluster=__default, Current=3, Desired=2;]" ``` **Note:** The e2e test `e2e_operator_upgrade_sharded_cluster` is failing and we are aware of that. It is related to how the state configmap is handled during a migration. We opened this PR nonetheless as this failure is not directly related to these changes, but we are working on adding a fix. --------- Co-authored-by: Łukasz Sierant <[email protected]>
1 parent a88de45 commit 5bbf035

14 files changed

+604
-97
lines changed

RELEASE_NOTES.md

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,9 @@
22
<!-- Next Release -->
33
# MongoDB Enterprise Kubernetes Operator 1.32.0
44

5+
## New Features
6+
* **MongoDB**: To ensure the correctness of scaling operations, a new validation has been added to Sharded Cluster deployments. This validation restricts scaling different components in two directions simultaneously within a single change to the YAML file. For example, it is not allowed to add more nodes (scaling up) to shards while simultaneously removing (scaling down) config servers or mongos. This restriction also applies to multi-cluster deployments. A simple change that involves "moving" one node from one cluster to another—without altering the total number of members—will now be blocked. It is necessary to perform a scale-up operation first and then execute a separate change for scaling down.
7+
58
## Bug Fixes
69
* Fixes the bug when status of `MongoDBUser` was being set to `Updated` prematurely. For example, new users were not immediately usable following `MongoDBUser` creation despite the operator reporting `Updated` state.
710
* Fixed a bug causing cluster health check issues when ordering of users and tokens differed in Kubeconfig.

controllers/om/automation_status_test.go

Lines changed: 2 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -39,7 +39,8 @@ func TestCheckAutomationStatusIsGoal(t *testing.T) {
3939
relevantProcesses: []string{"a", "b"},
4040
},
4141
expectedResult: true,
42-
expectedMsg: "processes that reached goal state: [a b]",
42+
// We can not check for the full message as the ordering of the processes won't be deterministic (stored in a map)
43+
expectedMsg: "processes that reached goal state:",
4344
}, {
4445
name: "one not in goal state",
4546
args: args{

controllers/operator/common_controller_test.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -480,7 +480,7 @@ func checkReconcilePending(ctx context.Context, t *testing.T, reconciler reconci
480480
assert.Nil(t, e, "When pending, error should be nil")
481481
assert.Equal(t, failedResult, result)
482482

483-
// also need to make sure the object status is updated to failed
483+
// also need to make sure the object status is pending
484484
assert.NoError(t, client.Get(ctx, mock.ObjectKeyFromApiObject(object), object))
485485
assert.Equal(t, status.PhasePending, object.Status.Phase)
486486
assert.Contains(t, object.Status.Message, expectedErrorMessage)

controllers/operator/construct/scalers/appdb_scaler.go

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -33,6 +33,10 @@ func (s *appDBSingleClusterScaler) DesiredReplicas() int {
3333
return s.opsManager.Spec.AppDB.Members
3434
}
3535

36+
func (s *appDBSingleClusterScaler) TargetReplicas() int {
37+
return s.DesiredReplicas()
38+
}
39+
3640
func (s *appDBSingleClusterScaler) CurrentReplicas() int {
3741
return s.opsManager.Status.AppDbStatus.Members
3842
}
@@ -48,3 +52,5 @@ func (s *appDBSingleClusterScaler) MemberClusterName() string {
4852
func (s *appDBSingleClusterScaler) MemberClusterNum() int {
4953
return 0
5054
}
55+
56+
func (s *appDBSingleClusterScaler) ScalerDescription() string { return "AppDB" }

controllers/operator/construct/scalers/interfaces/interfaces.go

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,9 @@ import "github.com/mongodb/mongodb-kubernetes-operator/pkg/util/scale"
55
type MultiClusterReplicaSetScaler interface {
66
scale.ReplicaSetScaler
77
ScalingFirstTime() bool
8+
TargetReplicas() int
89
MemberClusterName() string
910
MemberClusterNum() int
11+
// ScalerDescription contains the name of the component associated to that scaler (shard, config server, AppDB...)
12+
ScalerDescription() string
1013
}

controllers/operator/construct/scalers/replicaset_scaler.go

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -142,6 +142,10 @@ func (s *MultiClusterReplicaSetScaler) MemberClusterNum() int {
142142
return s.memberClusterNum
143143
}
144144

145+
func (s *MultiClusterReplicaSetScaler) ScalerDescription() string {
146+
return s.scalerDescription
147+
}
148+
145149
func (s *MultiClusterReplicaSetScaler) String() string {
146150
return fmt.Sprintf("{MultiClusterReplicaSetScaler (%s): still scaling: %t (finishing this reconcile: %t), clusterName=%s, clusterIdx=%d, current/target replicas:%d/%d, "+
147151
"replicas this reconciliation: %d, scaling first time: %t}", s.scalerDescription, s.CurrentReplicas() != s.TargetReplicas(), scale.ReplicasThisReconciliation(s) == s.TargetReplicas(), s.memberClusterName, s.memberClusterNum,

controllers/operator/mongodbshardedcluster_controller.go

Lines changed: 74 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -91,6 +91,21 @@ type ShardedClusterDeploymentState struct {
9191
Status *mdbv1.MongoDbStatus `json:"status"`
9292
}
9393

94+
// updateStatusFromResourceStatus updates the status in the deployment state with values from the resource status with additional ensurance that no data is accidentally lost.
95+
// In a rare situation when we're performing an upgrade of the operator from non-deployment state version (<=1.27) the migrateToNewDeploymentState
96+
// function correctly migrates the sizes of the cluster, but then, in case of an early return (in case of any error or waiting too long for the sts/agents)
97+
// the updateStatus might clear the migrated data.
98+
// This function ensures we're copying the status, but at the same time we're not losing those sizes from the deployment state.
99+
// The logic of updateStatus in the reconciler works on options. If the option is not passed, the value is not updated, but it's also not cleared if the option is not passed.
100+
// Early returns with updateStatus don't pass any options, so the calculated status shouldn't clear the sizes we've just calculated into the deployment state.
101+
func (s *ShardedClusterDeploymentState) updateStatusFromResourceStatus(statusFromResource mdbv1.MongoDbStatus) {
102+
resultStatus := statusFromResource.DeepCopy()
103+
if resultStatus.SizeStatusInClusters == nil && s.Status.SizeStatusInClusters != nil {
104+
resultStatus.SizeStatusInClusters = s.Status.SizeStatusInClusters.DeepCopy()
105+
}
106+
s.Status = resultStatus
107+
}
108+
94109
func NewShardedClusterDeploymentState() *ShardedClusterDeploymentState {
95110
return &ShardedClusterDeploymentState{
96111
CommonDeploymentState: CommonDeploymentState{ClusterMapping: map[string]int{}},
@@ -204,6 +219,11 @@ func (r *ShardedClusterReconcileHelper) createShardsMemberClusterLists(shardsMap
204219
return count
205220
}
206221
}
222+
// Because we store one common distribution for all shards in ShardMongodsInClusters, we need to make sure
223+
// we assign a size of 0 to newly created shards, as they haven't scaled yet.
224+
if shardIdx >= deploymentState.Status.ShardCount {
225+
return 0
226+
}
207227
if count, ok := deploymentState.Status.SizeStatusInClusters.ShardMongodsInClusters[memberClusterName]; ok {
208228
// Otherwise get the default one ShardMongodsInClusters
209229
// ShardMongodsInClusters is not correct in the edge case where all shards are overridden
@@ -647,6 +667,41 @@ func NewShardedClusterReconcilerHelper(ctx context.Context, reconciler *Reconcil
647667
return helper, nil
648668
}
649669

670+
func blockScalingBothWays(desiredReplicasScalers []interfaces.MultiClusterReplicaSetScaler) error {
671+
scalingUp := false
672+
scalingDown := false
673+
var scalingUpLogs []string
674+
var scalingDownLogs []string
675+
676+
// We have one scaler instance per component per cluster. That means we block scaling both ways across components,
677+
// but also within a single component
678+
// For example, if a component (e.g the config server) tries to scale up on member cluster 1 and scale down on
679+
// member cluster 2, reconciliation will be blocked, even if the total number of replicas for this component stays
680+
// the same.
681+
for _, mcScaler := range desiredReplicasScalers {
682+
desired := mcScaler.TargetReplicas()
683+
current := mcScaler.CurrentReplicas()
684+
logMessage := fmt.Sprintf("Component=%s, Cluster=%s, Current=%d, Desired=%d;", mcScaler.ScalerDescription(), mcScaler.MemberClusterName(), current, desired)
685+
if desired > current {
686+
scalingUp = true
687+
scalingUpLogs = append(scalingUpLogs, logMessage)
688+
}
689+
if desired < current {
690+
scalingDown = true
691+
scalingDownLogs = append(scalingDownLogs, logMessage)
692+
}
693+
}
694+
695+
if scalingUp && scalingDown {
696+
return xerrors.Errorf(
697+
"Cannot perform scale up and scale down operations at the same time. Scaling Up: %v, Scaling Down: %v",
698+
scalingUpLogs, scalingDownLogs,
699+
)
700+
}
701+
702+
return nil
703+
}
704+
650705
func (r *ShardedClusterReconcileHelper) initializeStateStore(ctx context.Context, reconciler *ReconcileCommonController, sc *mdbv1.MongoDB, log *zap.SugaredLogger) error {
651706
r.deploymentState = NewShardedClusterDeploymentState()
652707

@@ -724,17 +779,26 @@ func (r *ShardedClusterReconcileHelper) Reconcile(ctx context.Context, log *zap.
724779
return r.commonController.updateStatus(ctx, sc, workflow.Invalid("%s", err.Error()), log)
725780
}
726781

727-
if !architectures.IsRunningStaticArchitecture(sc.Annotations) {
728-
agents.UpgradeAllIfNeeded(ctx, agents.ClientSecret{Client: r.commonController.client, SecretClient: r.commonController.SecretClient}, r.omConnectionFactory, GetWatchedNamespace(), false)
729-
}
730-
731782
log.Info("-> ShardedCluster.Reconcile")
732783
log.Infow("ShardedCluster.Spec", "spec", sc.Spec)
733784
log.Infow("ShardedCluster.Status", "status", r.deploymentState.Status)
734785
log.Infow("ShardedCluster.deploymentState", "sizeStatus", r.deploymentState.Status.MongodbShardedClusterSizeConfig, "sizeStatusInClusters", r.deploymentState.Status.SizeStatusInClusters)
735786

736787
r.logAllScalers(log)
737788

789+
// After processing normal validations, we check for conflicting scale-up and scale-down operations within the same
790+
// reconciliation cycle. If both scaling directions are detected, we block the reconciliation.
791+
// This is not currently possible to do it safely with the operator. We check direction of scaling to decide for
792+
// global operations like publishing AC first.
793+
// Therefore, we can obtain inconsistent behaviour in case scaling goes in both directions.
794+
if err := blockScalingBothWays(r.getAllScalers()); err != nil {
795+
return r.updateStatus(ctx, sc, workflow.Failed(err), log)
796+
}
797+
798+
if !architectures.IsRunningStaticArchitecture(sc.Annotations) {
799+
agents.UpgradeAllIfNeeded(ctx, agents.ClientSecret{Client: r.commonController.client, SecretClient: r.commonController.SecretClient}, r.omConnectionFactory, GetWatchedNamespace(), false)
800+
}
801+
738802
projectConfig, credsConfig, err := project.ReadConfigAndCredentials(ctx, r.commonController.client, r.commonController.SecretClient, sc, log)
739803
if err != nil {
740804
return r.updateStatus(ctx, sc, workflow.Failed(err), log)
@@ -2238,7 +2302,7 @@ func (r *ShardedClusterReconcileHelper) updateStatus(ctx context.Context, resour
22382302
} else {
22392303
// UpdateStatus in the sharded cluster controller should be executed only once per reconcile (always with a return)
22402304
// We are saving the status and writing back to the state configmap at this time
2241-
r.deploymentState.Status = resource.Status.DeepCopy()
2305+
r.deploymentState.updateStatusFromResourceStatus(resource.Status)
22422306
if err := r.stateStore.WriteState(ctx, r.deploymentState, log); err != nil {
22432307
return r.commonController.updateStatus(ctx, resource, workflow.Failed(xerrors.Errorf("Failed to write deployment state after updating status: %w", err)), log, nil)
22442308
}
@@ -2282,11 +2346,12 @@ func (r *ShardedClusterReconcileHelper) GetShardScaler(shardIdx int, memberClust
22822346
return scalers.NewMultiClusterReplicaSetScaler(fmt.Sprintf("shard idx %d", shardIdx), r.desiredShardsConfiguration[shardIdx].ClusterSpecList, memberCluster.Name, memberCluster.Index, r.shardsMemberClustersMap[shardIdx])
22832347
}
22842348

2285-
func (r *ShardedClusterReconcileHelper) getAllScalers() []scale.ReplicaSetScaler {
2286-
var result []scale.ReplicaSetScaler
2349+
func (r *ShardedClusterReconcileHelper) getAllScalers() []interfaces.MultiClusterReplicaSetScaler {
2350+
var result []interfaces.MultiClusterReplicaSetScaler
22872351
for shardIdx := 0; shardIdx < r.sc.Spec.ShardCount; shardIdx++ {
22882352
for _, memberCluster := range r.shardsMemberClustersMap[shardIdx] {
2289-
result = append(result, r.GetShardScaler(shardIdx, memberCluster))
2353+
scaler := r.GetShardScaler(shardIdx, memberCluster)
2354+
result = append(result, scaler)
22902355
}
22912356
}
22922357

@@ -2673,7 +2738,7 @@ func (r *ShardedClusterReconcileHelper) isStillScaling() bool {
26732738
// The difference vs isStillScaling is subtle. isStillScaling tells us if we're generally in the process of scaling (current sizes != spec).
26742739
func (r *ShardedClusterReconcileHelper) shouldContinueScalingOneByOne() bool {
26752740
for _, s := range r.getAllScalers() {
2676-
if scale.ReplicasThisReconciliation(s) != s.(*scalers.MultiClusterReplicaSetScaler).TargetReplicas() {
2741+
if scale.ReplicasThisReconciliation(s) != s.TargetReplicas() {
26772742
return true
26782743
}
26792744
}

0 commit comments

Comments
 (0)