Skip to content

Commit d10efc3

Browse files
Julien-Benlsierant
authored andcommitted
CLOUDP-288588: MC Sharded DR - workaround deadlocked mongos (#4055)
# Summary This patch implements workaround for the deadlock situation happening when config server topology is changed (process list is modified) and mongos processes are undergoing rolling restart See [description in TD](REDACTED) for more information. Changes: - A new way (`GetMongoDBDeploymentState`) of gathering the deployment state by combining agents' signals from two endpoints (agent pings and agent goal statuses). We use them separately in [waitForAgentsToRegister](https://github.com/10gen/ops-manager-kubernetes/blob/6a02d755cbb2f54955c9510e97b1aee6e9ce89e2/controllers/operator/mongodbshardedcluster_controller.go#L1651) and in [WaitForReadyState](https://github.com/10gen/ops-manager-kubernetes/blob/6a02d755cbb2f54955c9510e97b1aee6e9ce89e2/controllers/operator/mongodbshardedcluster_controller.go#L1679). Combining those signals in one structure allows us to easier reason about the state of the deployment. - In case of this deadlock workaround we use `GetMongoDBDeploymentState` to asses whether we have: - only mongos processes in a goal state - we have stale processes (of which agents didn't report pings for >2mins), indicating we have a disaster scenario (e.g. k8s cluster being down) - agents of those mongos processes are performing RollingChangeArgs move of the plan - on top of that, we additionally verify that we're in the process of scaling (e.g. removing unhealthy nodes, but we don't restrict only to that case) - If all the above is satisfied we allow the operator to **not wait for mongos to reach goal state** with their AC because they cannot progress due to unhealthy nodes being present in the cluster. - `GetMongoDBDeploymentState` might be used in the future to refactor and unify existing checks for registered agents and the goal states - A new way of reusing e2e tests for both CloudQA and Ops Manager (see `multi_cluster_sharded_disaster_recovery.py`) - This is implemented by checking if `ops_manager_version` env var is `cloud_qa`, if yes then we're in the variant that will run the test against Cloud QA. Otherwise the test will deploy OM and use it to deploy the resources. ## Proof of Work ### Patch with the workaround enabled ([evg link](https://spruce.mongodb.com/task/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0eac1b1cc300075c0b3b_25_02_14_09_36_45?execution=0&sortBy=STATUS&sortDir=ASC)) At some point the operator detects this situation [here](https://parsley.mongodb.com/taskFile/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0eac1b1cc300075c0b3b_25_02_14_09_36_45/0/mongodb-enterprise-operator-multi-cluster-6d9cfcfc4b-zflvq-mongodb-enterprise-operator-multi-cluster.log?bookmarks=0%2C19689&selectedLineRange=L1054&shareLine=1054) ``` 2025-02-14T10:12:20.522Z WARN operator/mongodbshardedcluster_controller.go:2864 Detected mongos [{Hostname:sh-disaster-recovery-mongos-0-0-svc.a-1739526583-2kq82uy69jz.svc.cluster.local LastAgentPing:2025-02-14 10:11:49 +0000 UTC GoalVersionAchieved:4 Plan:[RollingChangeArgs] ProcessName:sh-disaster-recovery-mongos-0-0}] performing RollingChangeArgs operation while there are processes in the cluster that are considered down. Skipping waiting for those mongos processes in order to allow the operator to perform scaling. Please verify the list of stale (down/unhealthy) processes and change MongoDB resource to remove them from the cluster. The operator will not perform removal of those procesess automatically. Hostnames of stale processes: [sh-disaster-recovery-mongos-2-0-svc.a-1739526583-2kq82uy69jz.svc.cluster.local sh-disaster-recovery-mongos-2-1-svc.a-1739526583-2kq82uy69jz.svc.cluster.local sh-disaster-recovery-0-2-0-svc.a-1739526583-2kq82uy69jz.svc.cluster.local sh-disaster-recovery-1-2-0-svc.a-1739526583-2kq82uy69jz.svc.cluster.local sh-disaster-recovery-1-2-1-svc.a-1739526583-2kq82uy69jz.svc.cluster.local sh-disaster-recovery-config-2-0-svc.a-1739526583-2kq82uy69jz.svc.cluster.local] ``` We can see the list of stale processes contains all the hosts from the failed cluster (clusterIdx=2). The next line shows that we'll remove mongos from the list of processes we're waiting for the goal state: ``` 2025-02-14T10:12:20.522Z WARN operator/mongodbshardedcluster_controller.go:2762 The following processes are skipped from waiting for the goal state: [sh-disaster-recovery-mongos-0-0] {"ShardedCluster": "a-1739526583-2kq82uy69jz/sh-disaster-recovery"} ``` The operator is then performing scaling down of those failed processes and we can observe the last deadlock detection [here](https://parsley.mongodb.com/taskFile/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0eac1b1cc300075c0b3b_25_02_14_09_36_45/0/mongodb-enterprise-operator-multi-cluster-6d9cfcfc4b-zflvq-mongodb-enterprise-operator-multi-cluster.log?bookmarks=0%2C19689&highlights=Updating%2520status%253A%2520phase%253DRunning%2CRollingChangeArgs%2Cperforming%2520RollingChangeArgs%2520operation%2520while%2520there%2520are%2520processes%2520in%2520the%2520cluster%2520that%2520are%2520considered%2520down&selectedLineRange=L1882&shareLine=1882) Afterwards it's only mongos that the operator is waiting for (deadlock detection didn't kick in this time as not all conditions are satisfied - we don't have stale processes anymore - so mongos is not removed from the list to wait for): ``` MongoDB agents haven't reached READY state; 1 processes waiting to reach automation config goal state (version=7): [sh-disaster-recovery-mongos-0-0@4], 9 processes reached goal state: [sh-disaster-recovery-0-0-0 sh-disaster-recovery-0-0-1 sh-disaster-recovery-config-1-0 sh-disaster-recovery-1-1-0 sh-disaster-recovery-1-0-1 sh-disaster-recovery-0-1-0 sh-disaster-recovery-1-0-0 sh-disaster-recovery-config-0-0 sh-disaster-recovery-config-0-1] ``` [Later](https://parsley.mongodb.com/taskFile/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0eac1b1cc300075c0b3b_25_02_14_09_36_45/0/mongodb-enterprise-operator-multi-cluster-6d9cfcfc4b-zflvq-mongodb-enterprise-operator-multi-cluster.log?bookmarks=0%2C19689&highlights=Updating%2520status%253A%2520phase%253DRunning%2CRollingChangeArgs%2Cperforming%2520RollingChangeArgs%2520operation%2520while%2520there%2520are%2520processes%2520in%2520the%2520cluster%2520that%2520are%2520considered%2520down&selectedLineRange=L2799&shareLine=2799) we can see the cluster reaches running state and the deployment state contains scaled down members in the cluster-3: ```json { "sizeStatusInClusters": { "shardMongodsInClusters": { "kind-e2e-cluster-1": 2, "kind-e2e-cluster-2": 1, "kind-e2e-cluster-3": 0 }, "mongosCountInClusters": { "kind-e2e-cluster-1": 1, "kind-e2e-cluster-2": 0, "kind-e2e-cluster-3": 0 }, "configServerMongodsInClusters": { "kind-e2e-cluster-1": 2, "kind-e2e-cluster-2": 1, "kind-e2e-cluster-3": 0 } } } ``` ### Patch with the workaround manually commented out ([evg link](https://spruce.mongodb.com/task/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0e778e05410007c4c158_25_02_14_09_35_52?execution=4&sortBy=STATUS&sortDir=ASC)) We can [see](https://parsley.mongodb.com/taskFile/ops_manager_kubernetes_e2e_multi_cluster_kind_e2e_multi_cluster_sharded_disaster_recovery_patch_647ca535b9163e5e9eb9025516f3407bf5ad831a_67af0e778e05410007c4c158_25_02_14_09_35_52/0/mongodb-enterprise-operator-multi-cluster-68586b465c-fz7cw-mongodb-enterprise-operator-multi-cluster.log?bookmarks=0%2C3076&selectedLineRange=L2916&shareLine=2916) the operator is waiting for a long time for a deadlocked mongos and the test times out after 20 mins. ``` automation agents haven't reached READY state during defined interval: MongoDB agents haven't reached READY state; 1 processes waiting to reach automation config goal state (version=5): [sh-disaster-recovery-mongos-0-0@4], 9 processes reached goal state: [sh-disaster-recovery-0-1-0 sh-disaster-recovery-1-1-0 sh-disaster-recovery-1-0-1 sh-disaster-recovery-1-0-0 sh-disaster-recovery-0-0-0 sh-disaster-recovery-0-0-1 sh-disaster-recovery-config-0-0 sh-disaster-recovery-config-1-0 sh-disaster-recovery-config-0-1] ``` --------- Co-authored-by: Łukasz Sierant <[email protected]>
1 parent abed091 commit d10efc3

22 files changed

+1777
-197
lines changed

.evergreen.yml

Lines changed: 4 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -790,8 +790,7 @@ task_groups:
790790
- e2e_multi_cluster_sharded_simplest_no_mesh
791791
- e2e_multi_cluster_sharded_tls_no_mesh
792792
- e2e_multi_cluster_sharded_tls
793-
# To re-activate as part of https://jira.mongodb.org/browse/CLOUDP-288588
794-
#- e2e_multi_cluster_sharded_disaster_recovery
793+
- e2e_multi_cluster_sharded_disaster_recovery
795794
- e2e_sharded_cluster
796795
- e2e_sharded_cluster_agent_flags
797796
- e2e_sharded_cluster_custom_podspec
@@ -869,6 +868,7 @@ task_groups:
869868
- e2e_om_update_before_reconciliation
870869
- e2e_om_feature_controls
871870
- e2e_multi_cluster_appdb_state_operator_upgrade_downgrade
871+
- e2e_multi_cluster_sharded_disaster_recovery
872872
# disabled tests:
873873
# - e2e_om_multiple # multi-cluster failures in EVG
874874
# - e2e_om_appdb_scale_up_down # test not "reused" for multi-cluster appdb
@@ -921,14 +921,15 @@ task_groups:
921921
- e2e_multi_cluster_appdb_state_operator_upgrade_downgrade
922922
- e2e_om_update_before_reconciliation
923923
- e2e_om_feature_controls
924+
- e2e_multi_cluster_sharded_disaster_recovery
924925
<<: *teardown_group
925926

926927
# Dedicated task group for deploying OM Multi-Cluster when the operator is in the central cluster
927928
# that is not in the mesh
928929
- name: e2e_multi_cluster_om_operator_not_in_mesh_task_group
929930
max_hosts: -1
930931
<<: *setup_group_multi_cluster
931-
<<: *setup_and_teardown_task_cloudqa
932+
<<: *setup_and_teardown_task
932933
tasks:
933934
- e2e_multi_cluster_om_clusterwide_operator_not_in_mesh_networking
934935
<<: *teardown_group

controllers/om/agent.go

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -9,7 +9,7 @@ import (
99

1010
// Checks if the agents have registered.
1111

12-
type automationAgentStatusResponse struct {
12+
type AutomationAgentStatusResponse struct {
1313
OMPaginated
1414
AutomationAgents []AgentStatus `json:"results"`
1515
}
@@ -22,7 +22,7 @@ type AgentStatus struct {
2222
TypeName string `json:"typeName"`
2323
}
2424

25-
var _ Paginated = automationAgentStatusResponse{}
25+
var _ Paginated = AutomationAgentStatusResponse{}
2626

2727
// IsRegistered will return true if this given agent has `hostname_prefix` as a
2828
// prefix. This is needed to check if the given agent has registered.
@@ -48,7 +48,7 @@ func (agent AgentStatus) IsRegistered(hostnamePrefix string, log *zap.SugaredLog
4848
}
4949

5050
// Results are needed to fulfil the Paginated interface
51-
func (aar automationAgentStatusResponse) Results() []interface{} {
51+
func (aar AutomationAgentStatusResponse) Results() []interface{} {
5252
ans := make([]interface{}, len(aar.AutomationAgents))
5353
for i, aa := range aar.AutomationAgents {
5454
ans[i] = aa

controllers/om/deployment.go

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -921,6 +921,7 @@ func (d Deployment) removeProcesses(processNames []string, log *zap.SugaredLogge
921921
for _, p2 := range processNames {
922922
if p.Name() == p2 {
923923
found = true
924+
break
924925
}
925926
}
926927
if !found {

controllers/om/mockedomclient.go

Lines changed: 10 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -68,6 +68,9 @@ type MockedOmConnection struct {
6868
hostResults *host.Result
6969
agentHostnameMap map[string]struct{}
7070

71+
ReadAutomationStatusFunc func() (*AutomationStatus, error)
72+
ReadAutomationAgentsFunc func(int) (Paginated, error)
73+
7174
numRequestsSent int
7275
AgentAPIKey string
7376
OrganizationsWithGroups map[*Organization][]*Project
@@ -449,6 +452,9 @@ func (oc *MockedOmConnection) GenerateAgentKey() (string, error) {
449452

450453
func (oc *MockedOmConnection) ReadAutomationStatus() (*AutomationStatus, error) {
451454
oc.addToHistory(reflect.ValueOf(oc.ReadAutomationStatus))
455+
if oc.ReadAutomationStatusFunc != nil {
456+
return oc.ReadAutomationStatusFunc()
457+
}
452458

453459
if oc.AgentsDelayCount <= 0 {
454460
// Emulating "agents reached goal state": returning the proper status for all the
@@ -462,15 +468,17 @@ func (oc *MockedOmConnection) ReadAutomationStatus() (*AutomationStatus, error)
462468

463469
func (oc *MockedOmConnection) ReadAutomationAgents(pageNum int) (Paginated, error) {
464470
oc.addToHistory(reflect.ValueOf(oc.ReadAutomationAgents))
471+
if oc.ReadAutomationAgentsFunc != nil {
472+
return oc.ReadAutomationAgentsFunc(pageNum)
473+
}
465474

466475
results := make([]AgentStatus, 0)
467476
for _, r := range oc.hostResults.Results {
468477
results = append(results,
469478
AgentStatus{Hostname: r.Hostname, LastConf: time.Now().Add(time.Second * -1).Format(time.RFC3339)})
470479
}
471480

472-
// todo extend this for real testing
473-
return automationAgentStatusResponse{AutomationAgents: results}, nil
481+
return AutomationAgentStatusResponse{AutomationAgents: results}, nil
474482
}
475483

476484
func (oc *MockedOmConnection) GetHosts() (*host.Result, error) {

controllers/om/omclient.go

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -512,7 +512,7 @@ func (oc *HTTPOmConnection) ReadAutomationAgents(pageNum int) (Paginated, error)
512512
if err != nil {
513513
return nil, err
514514
}
515-
var resp automationAgentStatusResponse
515+
var resp AutomationAgentStatusResponse
516516
if err := json.Unmarshal(ans, &resp); err != nil {
517517
return nil, err
518518
}

controllers/om/replicaset/om_replicaset.go

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -36,7 +36,7 @@ func BuildFromStatefulSetWithReplicas(set appsv1.StatefulSet, dbSpec mdbv1.DbSpe
3636
// https://jira.mongodb.org/browse/HELP-3818?focusedCommentId=1548348 for more details)
3737
// Note, that we are skipping setting nodes as "disabled" (but the code is commented to be able to revert this if
3838
// needed)
39-
func PrepareScaleDownFromMap(omClient om.Connection, rsMembers map[string][]string, healthyProcessesToWaitForGoalState []string, log *zap.SugaredLogger) error {
39+
func PrepareScaleDownFromMap(omClient om.Connection, rsMembers map[string][]string, processesToWaitForGoalState []string, log *zap.SugaredLogger) error {
4040
processes := make([]string, 0)
4141
for _, v := range rsMembers {
4242
processes = append(processes, v...)
@@ -59,7 +59,7 @@ func PrepareScaleDownFromMap(omClient om.Connection, rsMembers map[string][]stri
5959
return xerrors.Errorf("unable to set votes, priority to 0 in Ops Manager, hosts: %v, err: %w", processes, err)
6060
}
6161

62-
if err := om.WaitForReadyState(omClient, healthyProcessesToWaitForGoalState, false, log); err != nil {
62+
if err := om.WaitForReadyState(omClient, processesToWaitForGoalState, false, log); err != nil {
6363
return err
6464
}
6565

controllers/operator/agents/agents.go

Lines changed: 184 additions & 41 deletions
Original file line numberDiff line numberDiff line change
@@ -3,6 +3,9 @@ package agents
33
import (
44
"context"
55
"fmt"
6+
"maps"
7+
"slices"
8+
"time"
69

710
"go.uber.org/zap"
811
"golang.org/x/xerrors"
@@ -30,6 +33,8 @@ type retryParams struct {
3033
retrials int
3134
}
3235

36+
const RollingChangeArgs = "RollingChangeArgs"
37+
3338
// EnsureAgentKeySecretExists checks if the Secret with specified name (<groupId>-group-secret) exists, otherwise tries to
3439
// generate agent key using OM public API and create Secret containing this key. Generation of a key is expected to be
3540
// a rare operation as the group creation api generates agent key already (so the only possible situation is when the group
@@ -107,6 +112,184 @@ func getAgentRegisterError(errorMsg string) error {
107112
"name ('cluster.local'): %s", errorMsg))
108113
}
109114

115+
const StaleProcessDuration = time.Minute * 2
116+
117+
// ProcessState represents the state of the mongodb process.
118+
// Most importantly it contains the information whether the node is down (precisely whether the agent running next to mongod is actively reporting pings to OM),
119+
// what is the last version of the automation config achieved and the step on which the agent is currently executing the plan.
120+
type ProcessState struct {
121+
Hostname string
122+
LastAgentPing time.Time
123+
GoalVersionAchieved int
124+
Plan []string
125+
ProcessName string
126+
}
127+
128+
// NewProcessState should be used to create new instances of ProcessState as it sets some reasonable default values.
129+
// As ProcessState is combining the data from two sources, we don't have any guarantees that we'll have the information about the given hostname
130+
// available from both sources, therefore we need to always assume some defaults.
131+
func NewProcessState(hostname string) ProcessState {
132+
return ProcessState{
133+
Hostname: hostname,
134+
LastAgentPing: time.Time{},
135+
GoalVersionAchieved: -1,
136+
Plan: nil,
137+
}
138+
}
139+
140+
// IsStale returns true if this process is considered down, i.e. last ping of the agent is later than 2 minutes ago
141+
// We use an in-the-middle value when considering the process to be down:
142+
// - in waitForAgentsToRegister we use 1 min to consider the process "not registered"
143+
// - Ops Manager is using 5 mins as a default for considering process as stale
144+
func (p ProcessState) IsStale() bool {
145+
return p.LastAgentPing.Add(StaleProcessDuration).Before(time.Now())
146+
}
147+
148+
// MongoDBClusterStateInOM represents the state of the whole deployment from the Ops Manager's perspective by combining singnals about the processes from two sources:
149+
// - from om.Connection.ReadAutomationAgents to get last ping of the agent (/groups/<groupId>/agents/AUTOMATION)
150+
// - from om.Connection.ReadAutomationStatus to get the list of agent health statuses, AC version achieved, step of the agent's plan (/groups/<groupId>/automationStatus)
151+
type MongoDBClusterStateInOM struct {
152+
GoalVersion int
153+
ProcessStateMap map[string]ProcessState
154+
}
155+
156+
// GetMongoDBClusterState executes requests to OM from the given omConnection to gather the current deployment state.
157+
// It combines the data from the automation status and the list of automation agents.
158+
func GetMongoDBClusterState(omConnection om.Connection) (MongoDBClusterStateInOM, error) {
159+
var agentStatuses []om.AgentStatus
160+
_, err := om.TraversePages(
161+
omConnection.ReadAutomationAgents,
162+
func(aa interface{}) bool {
163+
agentStatuses = append(agentStatuses, aa.(om.AgentStatus))
164+
return false
165+
},
166+
)
167+
if err != nil {
168+
return MongoDBClusterStateInOM{}, xerrors.Errorf("error when reading automation agent pages: %v", err)
169+
}
170+
171+
automationStatus, err := omConnection.ReadAutomationStatus()
172+
if err != nil {
173+
return MongoDBClusterStateInOM{}, xerrors.Errorf("error reading automation status: %v", err)
174+
}
175+
176+
processStateMap, err := calculateProcessStateMap(automationStatus.Processes, agentStatuses)
177+
if err != nil {
178+
return MongoDBClusterStateInOM{}, err
179+
}
180+
181+
return MongoDBClusterStateInOM{
182+
GoalVersion: automationStatus.GoalVersion,
183+
ProcessStateMap: processStateMap,
184+
}, nil
185+
}
186+
187+
func (c *MongoDBClusterStateInOM) GetProcessState(hostname string) ProcessState {
188+
if processState, ok := c.ProcessStateMap[hostname]; ok {
189+
return processState
190+
}
191+
192+
return NewProcessState(hostname)
193+
}
194+
195+
func (c *MongoDBClusterStateInOM) GetProcesses() []ProcessState {
196+
return slices.Collect(maps.Values(c.ProcessStateMap))
197+
}
198+
199+
func (c *MongoDBClusterStateInOM) GetProcessesNotInGoalState() []ProcessState {
200+
return slices.DeleteFunc(slices.Collect(maps.Values(c.ProcessStateMap)), func(processState ProcessState) bool {
201+
return processState.GoalVersionAchieved >= c.GoalVersion
202+
})
203+
}
204+
205+
// calculateProcessStateMap combines information from ProcessStatuses and AgentStatuses returned by OpsManager
206+
// and maps them to a unified data structure.
207+
//
208+
// The resulting ProcessState combines information from both agent and process status when refer to the same hostname.
209+
// It is not guaranteed that we'll have the information from two sources, so in case one side is missing the defaults
210+
// would be present as defined in NewProcessState.
211+
// If multiple statuses exist for the same hostname, subsequent entries overwrite ones.
212+
// Fields such as GoalVersionAchieved default to -1 if never set, and Plan defaults to nil.
213+
// LastAgentPing defaults to the zero time if no AgentStatus entry is available.
214+
func calculateProcessStateMap(processStatuses []om.ProcessStatus, agentStatuses []om.AgentStatus) (map[string]ProcessState, error) {
215+
processStates := map[string]ProcessState{}
216+
for _, agentStatus := range agentStatuses {
217+
if agentStatus.TypeName != "AUTOMATION" {
218+
return nil, xerrors.Errorf("encountered unexpected agent type in agent status type in %+v", agentStatus)
219+
}
220+
processState, ok := processStates[agentStatus.Hostname]
221+
if !ok {
222+
processState = NewProcessState(agentStatus.Hostname)
223+
}
224+
lastPing, err := time.Parse(time.RFC3339, agentStatus.LastConf)
225+
if err != nil {
226+
return nil, xerrors.Errorf("wrong format for lastConf field: expected UTC format but the value is %s, agentStatus=%+v: %v", agentStatus.LastConf, agentStatus, err)
227+
}
228+
processState.LastAgentPing = lastPing
229+
230+
processStates[agentStatus.Hostname] = processState
231+
}
232+
233+
for _, processStatus := range processStatuses {
234+
processState, ok := processStates[processStatus.Hostname]
235+
if !ok {
236+
processState = NewProcessState(processStatus.Hostname)
237+
}
238+
processState.GoalVersionAchieved = processStatus.LastGoalVersionAchieved
239+
processState.ProcessName = processStatus.Name
240+
processState.Plan = processStatus.Plan
241+
processStates[processStatus.Hostname] = processState
242+
}
243+
244+
return processStates, nil
245+
}
246+
247+
func agentCheck(omConnection om.Connection, agentHostnames []string, log *zap.SugaredLogger) (string, bool) {
248+
registeredHostnamesSet := map[string]struct{}{}
249+
predicateFunc := func(aa interface{}) bool {
250+
automationAgent := aa.(om.Status)
251+
for _, hostname := range agentHostnames {
252+
if automationAgent.IsRegistered(hostname, log) {
253+
registeredHostnamesSet[hostname] = struct{}{}
254+
if len(registeredHostnamesSet) == len(agentHostnames) {
255+
return true
256+
}
257+
}
258+
}
259+
return false
260+
}
261+
262+
_, err := om.TraversePages(
263+
omConnection.ReadAutomationAgents,
264+
predicateFunc,
265+
)
266+
if err != nil {
267+
return fmt.Sprintf("Received error when reading automation agent pages: %v", err), false
268+
}
269+
270+
// convert to list of keys only for pretty printing in the error message
271+
var registeredHostnamesList []string
272+
for hostname := range registeredHostnamesSet {
273+
registeredHostnamesList = append(registeredHostnamesList, hostname)
274+
}
275+
276+
var msg string
277+
if len(registeredHostnamesList) == 0 {
278+
return fmt.Sprintf("None of %d expected agents has registered with OM, expected hostnames: %+v", len(agentHostnames), agentHostnames), false
279+
} else if len(registeredHostnamesList) == len(agentHostnames) {
280+
return fmt.Sprintf("All of %d expected agents have registered with OM, hostnames: %+v", len(registeredHostnamesList), registeredHostnamesList), true
281+
} else {
282+
var missingHostnames []string
283+
for _, expectedHostname := range agentHostnames {
284+
if _, ok := registeredHostnamesSet[expectedHostname]; !ok {
285+
missingHostnames = append(missingHostnames, expectedHostname)
286+
}
287+
}
288+
msg = fmt.Sprintf("Only %d of %d expected agents have registered with OM, missing hostnames: %+v, registered hostnames in OM: %+v, expected hostnames: %+v", len(registeredHostnamesList), len(agentHostnames), missingHostnames, registeredHostnamesList, agentHostnames)
289+
return msg, false
290+
}
291+
}
292+
110293
// waitUntilRegistered waits until all agents with 'agentHostnames' are registered in OM. Note, that wait
111294
// happens after retrial - this allows to skip waiting in case agents are already registered
112295
func waitUntilRegistered(omConnection om.Connection, log *zap.SugaredLogger, r retryParams, agentHostnames ...string) (bool, string) {
@@ -120,47 +303,7 @@ func waitUntilRegistered(omConnection om.Connection, log *zap.SugaredLogger, r r
120303
retrials := env.ReadIntOrDefault(util.PodWaitRetriesEnv, r.retrials)
121304

122305
agentsCheckFunc := func() (string, bool) {
123-
registeredHostnamesMap := map[string]struct{}{}
124-
_, err := om.TraversePages(
125-
omConnection.ReadAutomationAgents,
126-
func(aa interface{}) bool {
127-
automationAgent := aa.(om.Status)
128-
for _, hostname := range agentHostnames {
129-
if automationAgent.IsRegistered(hostname, log) {
130-
registeredHostnamesMap[hostname] = struct{}{}
131-
if len(registeredHostnamesMap) == len(agentHostnames) {
132-
return true
133-
}
134-
}
135-
}
136-
return false
137-
},
138-
)
139-
if err != nil {
140-
log.Errorw("Received error when reading automation agent pages", "err", err)
141-
}
142-
143-
// convert to list of keys only for pretty printing in the error message
144-
var registeredHostnamesList []string
145-
for hostname := range registeredHostnamesMap {
146-
registeredHostnamesList = append(registeredHostnamesList, hostname)
147-
}
148-
149-
var msg string
150-
if len(registeredHostnamesList) == 0 {
151-
return fmt.Sprintf("None of %d expected agents has registered with OM, expected hostnames: %+v", len(agentHostnames), agentHostnames), false
152-
} else if len(registeredHostnamesList) == len(agentHostnames) {
153-
return fmt.Sprintf("All of %d expected agents have registered with OM, hostnames: %+v", len(registeredHostnamesList), registeredHostnamesList), true
154-
} else {
155-
var missingHostnames []string
156-
for _, expectedHostname := range agentHostnames {
157-
if _, ok := registeredHostnamesMap[expectedHostname]; !ok {
158-
missingHostnames = append(missingHostnames, expectedHostname)
159-
}
160-
}
161-
msg = fmt.Sprintf("Only %d of %d expected agents have registered with OM, missing hostnames: %+v, registered hostnames in OM: %+v, expected hostnames: %+v", len(registeredHostnamesList), len(agentHostnames), missingHostnames, registeredHostnamesList, agentHostnames)
162-
return msg, false
163-
}
306+
return agentCheck(omConnection, agentHostnames, log)
164307
}
165308

166309
return util.DoAndRetry(agentsCheckFunc, log, retrials, waitSeconds)

0 commit comments

Comments
 (0)