Fix auth transition on edge-cases #321

nammn · 2025-08-07T15:04:22Z

Summary

Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:

{
  "step": "WaitAuthUpdate",
  "stepDoc": "Wait to update Auth",
  "isWaitStep": true,
  "started": "2025-08-07T14:59:40.213178437Z",
  "attempts": 512,
  "latestAttempt": "2025-08-07T15:09:20.966699961Z",
  "completed": null,
  "result": "wait"
}

Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator

if the readinessProbe blocks new nodes are not coming up
we want new nodes coming up
but we also want to block new configurations being applied, which the automation_status check in the
operator does

The core idea:

Configuration applied ≠ transition fully complete.

What happened in our tests:

we update auth via CR x509 -> scram
node-0 completed its auth transition (now uses scram, instead of x509)
Config server hasn't finished its auth transition yet
We hit a race condition where clusters were marked as "Running" too early and thus continued the rolling restart of nod e-0
node-0 restarted with the old X509 config (see below comment from the agent code)
The X509 process couldn’t access the SCRAM automation user
Leads to Error: "process...doesn't have the automation user"

in the mms-automation there is also a comment; that indicates thats they are handling the edge-case if an auth transition was not successful, they start the process with old auth to "finish" it. But this is exactly what causes our race condition

	// If a process went down unexpectedly in the middle of an auth transition,
	// we want to restart it with the old auth args.
	// Otherwise, it could be upgraded to the new auth state too soon,
	// and not be able to communicate with other shard members.

tl;dr: first node-0 moved to new auth, config not yet, node-0 restarted and during the restart config transitioned to the new auth while node-0 is again running old auth

Proof of Work

auth change tests are passing multiple times in a row: Link - the most flaky auth tests + Link2 - from the patch

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-08-07T15:05:17Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
The agent returns ready if the cluster is ready to accept requests. The operator uses this information to continue operational actions like restarts.
This can be problematic during auth transitions. We can have a period where we invalidate one auth while the other is not activated yet and we try to use the not supported one

Other Changes

Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

Copilot

Pull Request Overview

This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.

Key changes:

Added authentication transition detection in the automation status checker
Introduced logic to wait for auth-related moves to complete before marking clusters as ready
Added comprehensive test coverage for authentication transition scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File	Description
controllers/om/automation_status.go	Added authentication transition detection and `isAuthenticationTransitionMove` helper function
controllers/om/automation_status_test.go	Added comprehensive test cases for authentication transition scenarios
changelog/20250808_fix_fixing_auth_transition_edge_cases.md	Added changelog entry documenting the fix

mircea-cosbuc

LGTM with a nit

mircea-cosbuc · 2025-08-11T15:30:19Z

controllers/om/automation_status.go

@@ -113,17 +138,29 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin
 	}
 }

+// isAuthenticationTransitionMove returns true if the given move is related to authentication transitions
+func isAuthenticationTransitionMove(move string) bool {


I really like the approach of moving those phase specific checks outside of the readiness probe where we can only afford to treat them with a very wide brush.

mircea-cosbuc · 2025-08-11T15:32:01Z

controllers/om/automation_status.go

 func areAnyAgentsInKubeUpgradeMode(as *AutomationStatus, relevantProcesses []string, log *zap.SugaredLogger) bool {
 	for _, p := range as.Processes {
 		if !stringutil.Contains(relevantProcesses, p.Name) {
 			continue
 		}
-		for _, plan := range p.Plan {
+		for _, planStep := range p.Plan {


nit: In the code you introduced, you're ranging over p.Plan with move and this is refactored to use planStep instead. Let's stay consistent (I think we can use move since automation uses the same terminology).

nammn added 3 commits August 7, 2025 17:01

try out auth detection

d0c2fb5

refactor auth

a1ae41d

refactor auth

9d42913

linter

54dd7d2

nammn changed the title ~~Fix scram 222~~ Fix auth transition on edge-cases Aug 8, 2025

fix auth transition

2894667

nammn marked this pull request as ready for review August 8, 2025 08:59

nammn requested a review from a team as a code owner August 8, 2025 08:59

nammn requested review from m1kola, anandsyncs and Copilot August 8, 2025 08:59

Copilot AI reviewed Aug 8, 2025

View reviewed changes

nammn requested review from mircea-cosbuc, viveksinghggits, Julien-Ben and lucian-tosa August 8, 2025 09:12

nammn added 2 commits August 11, 2025 10:16

try update auth

b4fc956

fix unit test

f75b01c

mircea-cosbuc requested changes Aug 11, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix auth transition on edge-cases #321

Fix auth transition on edge-cases #321

nammn commented Aug 7, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 7, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

mircea-cosbuc left a comment

Uh oh!

mircea-cosbuc Aug 11, 2025

Uh oh!

mircea-cosbuc Aug 11, 2025 •

edited

Loading

Uh oh!

Uh oh!

Fix auth transition on edge-cases #321

Are you sure you want to change the base?

Fix auth transition on edge-cases #321

Conversation

nammn commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Aug 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

MCK 1.3.0 Release Notes

Bug Fixes

Other Changes

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

mircea-cosbuc left a comment

Choose a reason for hiding this comment

Uh oh!

mircea-cosbuc Aug 11, 2025

Choose a reason for hiding this comment

Uh oh!

mircea-cosbuc Aug 11, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nammn commented Aug 7, 2025 •

edited

Loading

github-actions bot commented Aug 7, 2025 •

edited

Loading

mircea-cosbuc Aug 11, 2025 •

edited

Loading