Skip to content

Commit 6075072

Browse files
authored
Fix auth transition on edge-cases (#321)
# Summary **Add Not-Ready Handling for Ongoing Auth Transitions**: This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress. This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and [we generally return ready for all wait steps](https://github.com/mongodb/mongodb-kubernetes/blob/f0050b8942545701e8cb9e42d54d14f0cb58ee6a/mongodb-community-operator/cmd/readiness/main.go#L139), regardless of whether auth is fully transitioned. Example status: ``` { "step": "WaitAuthUpdate", "stepDoc": "Wait to update Auth", "isWaitStep": true, "started": "2025-08-07T14:59:40.213178437Z", "attempts": 512, "latestAttempt": "2025-08-07T15:09:20.966699961Z", "completed": null, "result": "wait" } ``` **Why implemented in the operator and not readinessProbe**: I didn't fix the readinessProbe but rather the operator * if the readinessProbe blocks new nodes are not coming up * we want new nodes coming up * but we also want to block new configurations being applied, which the automation_status check in the operator does **The core idea:** * Configuration applied ≠ transition fully complete. **What happened in our tests**: * we update auth via CR x509 -> scram * `node-0` completed its auth transition (now uses scram, instead of x509) * `Config server` hasn't finished its auth transition yet * We hit a race condition where clusters were marked as "Running" too early and thus continued the rolling restart of `nod e-0` * `node-0` restarted with the old X509 config (see below comment from the agent code) * The X509 process couldn’t access the SCRAM automation user * Leads to Error: "process...doesn't have the automation user" - in the mms-automation there is also a comment; that indicates thats they are handling the edge-case if an auth transition was not successful, they start the process with old auth to "finish" it. But this is exactly what causes our race condition ``` // If a process went down unexpectedly in the middle of an auth transition, // we want to restart it with the old auth args. // Otherwise, it could be upgraded to the new auth state too soon, // and not be able to communicate with other shard members. ``` tl;dr: first `node-0` moved to new auth, `config` not yet, `node-0` restarted and during the restart `config` transitioned to the new auth while `node-0` is again running old auth ## Proof of Work - auth change tests are passing multiple times in a row: [Link](http://spruce.mongodb.com/version/6894b98218a2e90007437e99/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC) - the most flaky auth tests + [Link2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_static_mdb_kind_ubi_cloudqa_e2e_sharded_cluster_x509_to_scram_transition_patch_b29fb4ace63eec7102f8f034fd6c553b5d75c1a1_6894c0785c119f0007a58f3c_25_08_07_15_04_26/logs?execution=0) - from the patch ## Checklist - [ ] Have you linked a jira ticket and/or is the ticket in the title? - [x] Have you checked whether your jira ticket required DOCSP changes? - [x] Have you added changelog file? - use `skip-changelog` label if not needed - refer to [Changelog files and Release Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes) section in CONTRIBUTING.md for more details
1 parent 6f85608 commit 6075072

File tree

3 files changed

+177
-4
lines changed

3 files changed

+177
-4
lines changed
Lines changed: 7 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,7 @@
1+
---
2+
title: Fixing auth transition edge-cases
3+
kind: fix
4+
date: 2025-08-08
5+
---
6+
7+
* Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

controllers/om/automation_status.go

Lines changed: 40 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -15,7 +15,7 @@ import (
1515
"github.com/mongodb/mongodb-kubernetes/pkg/util/stringutil"
1616
)
1717

18-
const automationAgentKubeUpgradePlan = "ChangeVersionKube"
18+
const automationAgentKubeUpgradeMove = "ChangeVersionKube"
1919

2020
// AutomationStatus represents the status of automation agents registered with Ops Manager
2121
type AutomationStatus struct {
@@ -85,12 +85,25 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin
8585

8686
goalsNotAchievedMap := map[string]int{}
8787
goalsAchievedMap := map[string]int{}
88+
authTransitionsInProgress := map[string]string{}
89+
8890
for _, p := range as.Processes {
8991
if !stringutil.Contains(relevantProcesses, p.Name) {
9092
continue
9193
}
9294
if p.LastGoalVersionAchieved == as.GoalVersion {
9395
goalsAchievedMap[p.Name] = p.LastGoalVersionAchieved
96+
97+
// Check if authentication transitions are in the current plan.
98+
// If a process has reached goal version but still has auth-related moves in plan,
99+
// it means authentication transition is likely in progress.
100+
// The plan contains non-completed move names from the API.
101+
for _, move := range p.Plan {
102+
if isAuthenticationTransitionMove(move) {
103+
authTransitionsInProgress[p.Name] = move
104+
break
105+
}
106+
}
94107
} else {
95108
goalsNotAchievedMap[p.Name] = p.LastGoalVersionAchieved
96109
}
@@ -103,6 +116,18 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin
103116
goalsAchievedMsgList := slices.Collect(maps.Keys(goalsAchievedMap))
104117
sort.Strings(goalsAchievedMsgList)
105118

119+
// Check if any authentication transitions are in progress
120+
if len(authTransitionsInProgress) > 0 {
121+
var authTransitionMsgList []string
122+
for processName, step := range authTransitionsInProgress {
123+
authTransitionMsgList = append(authTransitionMsgList, fmt.Sprintf("%s:%s", processName, step))
124+
}
125+
log.Infow("Authentication transitions still in progress, waiting for completion",
126+
"processes", authTransitionMsgList)
127+
return false, fmt.Sprintf("authentication transitions in progress for %d processes: %s",
128+
len(authTransitionsInProgress), authTransitionMsgList)
129+
}
130+
106131
if len(goalsNotAchievedMap) > 0 {
107132
return false, fmt.Sprintf("%d processes waiting to reach automation config goal state (version=%d): %s, %d processes reached goal state: %s",
108133
len(goalsNotAchievedMap), as.GoalVersion, goalsNotAchievedMsgList, len(goalsAchievedMsgList), goalsAchievedMsgList)
@@ -113,17 +138,29 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin
113138
}
114139
}
115140

141+
// isAuthenticationTransitionMove returns true if the given move is related to authentication transitions
142+
func isAuthenticationTransitionMove(move string) bool {
143+
authMoves := map[string]struct{}{
144+
"UpdateAuth": {},
145+
"WaitAuthUpdate": {},
146+
}
147+
148+
_, ok := authMoves[move]
149+
150+
return ok
151+
}
152+
116153
func areAnyAgentsInKubeUpgradeMode(as *AutomationStatus, relevantProcesses []string, log *zap.SugaredLogger) bool {
117154
for _, p := range as.Processes {
118155
if !stringutil.Contains(relevantProcesses, p.Name) {
119156
continue
120157
}
121-
for _, plan := range p.Plan {
158+
for _, move := range p.Plan {
122159
// This means the following:
123160
// - the cluster is in static architecture
124161
// - the agents are in a dedicated upgrade process, waiting for their binaries to be replaced by kubernetes
125162
// - this can only happen if the statefulset is ready, therefore we are returning ready here
126-
if plan == automationAgentKubeUpgradePlan {
163+
if move == automationAgentKubeUpgradeMove {
127164
log.Debug("cluster is in changeVersionKube mode, returning the agent is ready.")
128165
return true
129166
}

controllers/om/automation_status_test.go

Lines changed: 130 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -75,7 +75,7 @@ func TestCheckAutomationStatusIsGoal(t *testing.T) {
7575
},
7676
{
7777
Name: "b",
78-
Plan: []string{"FCV", automationAgentKubeUpgradePlan},
78+
Plan: []string{"FCV", automationAgentKubeUpgradeMove},
7979
LastGoalVersionAchieved: 1,
8080
},
8181
},
@@ -119,3 +119,132 @@ func TestCheckAutomationStatusIsGoal(t *testing.T) {
119119
})
120120
}
121121
}
122+
123+
func TestCheckAutomationStatusIsGoal_AuthenticationTransitions(t *testing.T) {
124+
logger := zap.NewNop().Sugar()
125+
126+
tests := []struct {
127+
name string
128+
automationStatus *AutomationStatus
129+
relevantProcesses []string
130+
expectedReady bool
131+
expectedMessage string
132+
}{
133+
{
134+
name: "should wait for UpdateAuth move to complete",
135+
automationStatus: &AutomationStatus{
136+
GoalVersion: 5,
137+
Processes: []ProcessStatus{
138+
{
139+
Name: "rs0_0",
140+
LastGoalVersionAchieved: 5,
141+
Plan: []string{"UpdateAuth"},
142+
},
143+
},
144+
},
145+
relevantProcesses: []string{"rs0_0"},
146+
expectedReady: false,
147+
expectedMessage: "authentication transitions in progress for 1 processes",
148+
},
149+
{
150+
name: "should be ready when authentication transitions are complete",
151+
automationStatus: &AutomationStatus{
152+
GoalVersion: 5,
153+
Processes: []ProcessStatus{
154+
{
155+
Name: "rs0_0",
156+
LastGoalVersionAchieved: 5,
157+
Plan: []string{}, // Empty plan means all moves completed
158+
},
159+
},
160+
},
161+
relevantProcesses: []string{"rs0_0"},
162+
expectedReady: true,
163+
expectedMessage: "processes that reached goal state: [rs0_0]",
164+
},
165+
{
166+
name: "should wait for multiple processes with auth transitions",
167+
automationStatus: &AutomationStatus{
168+
GoalVersion: 7,
169+
Processes: []ProcessStatus{
170+
{
171+
Name: "rs0_0",
172+
LastGoalVersionAchieved: 7,
173+
Plan: []string{}, // This process completed
174+
},
175+
{
176+
Name: "rs0_1",
177+
LastGoalVersionAchieved: 7,
178+
Plan: []string{"WaitAuthUpdate"}, // Auth-related move in progress
179+
},
180+
},
181+
},
182+
relevantProcesses: []string{"rs0_0", "rs0_1"},
183+
expectedReady: false,
184+
expectedMessage: "authentication transitions in progress for 1 processes",
185+
},
186+
{
187+
name: "should ignore non-authentication moves in progress",
188+
automationStatus: &AutomationStatus{
189+
GoalVersion: 4,
190+
Processes: []ProcessStatus{
191+
{
192+
Name: "rs0_0",
193+
LastGoalVersionAchieved: 4,
194+
Plan: []string{"SomeOtherMove"}, // Non-auth move
195+
},
196+
},
197+
},
198+
relevantProcesses: []string{"rs0_0"},
199+
expectedReady: true,
200+
expectedMessage: "processes that reached goal state: [rs0_0]",
201+
},
202+
}
203+
204+
for _, tt := range tests {
205+
t.Run(tt.name, func(t *testing.T) {
206+
ready, message := checkAutomationStatusIsGoal(
207+
tt.automationStatus,
208+
tt.relevantProcesses,
209+
logger,
210+
)
211+
212+
assert.Equal(t, tt.expectedReady, ready, "Ready state should match expected")
213+
assert.Contains(t, message, tt.expectedMessage, "Message should contain expected text")
214+
215+
if tt.expectedReady {
216+
t.Logf("✅ Process correctly marked as ready: %s", message)
217+
} else {
218+
t.Logf("⏳ Process correctly waiting for auth transition: %s", message)
219+
}
220+
})
221+
}
222+
}
223+
224+
func TestIsAuthenticationTransitionMove(t *testing.T) {
225+
authMoves := []string{
226+
"UpdateAuth",
227+
"WaitAuthUpdate",
228+
}
229+
230+
nonAuthMoves := []string{
231+
"SomeOtherMove",
232+
"CreateIndex",
233+
"DropCollection",
234+
"BackupDatabase",
235+
}
236+
237+
for _, move := range authMoves {
238+
t.Run("auth_move_"+move, func(t *testing.T) {
239+
assert.True(t, isAuthenticationTransitionMove(move),
240+
"Move %s should be recognized as authentication transition", move)
241+
})
242+
}
243+
244+
for _, move := range nonAuthMoves {
245+
t.Run("non_auth_move_"+move, func(t *testing.T) {
246+
assert.False(t, isAuthenticationTransitionMove(move),
247+
"Move %s should not be recognized as authentication transition", move)
248+
})
249+
}
250+
}

0 commit comments

Comments
 (0)