Skip to content

Fix auth transition on edge-cases #321

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Fix auth transition on edge-cases #321

wants to merge 7 commits into from

Conversation

nammn
Copy link
Collaborator

@nammn nammn commented Aug 7, 2025

Summary

Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:

{
  "step": "WaitAuthUpdate",
  "stepDoc": "Wait to update Auth",
  "isWaitStep": true,
  "started": "2025-08-07T14:59:40.213178437Z",
  "attempts": 512,
  "latestAttempt": "2025-08-07T15:09:20.966699961Z",
  "completed": null,
  "result": "wait"
}

Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator

  • if the readinessProbe blocks new nodes are not coming up
  • we want new nodes coming up
  • but we also want to block new configurations being applied, which the automation_status check in the
    operator does

The core idea:

  • Configuration applied ≠ transition fully complete.

What happened in our tests:

  • we update auth via CR x509 -> scram
  • node-0 completed its auth transition (now uses scram, instead of x509)
  • Config server hasn't finished its auth transition yet
  • We hit a race condition where clusters were marked as "Running" too early and thus continued the rolling restart of nod e-0
  • node-0 restarted with the old X509 config (see below comment from the agent code)
  • The X509 process couldn’t access the SCRAM automation user
  • Leads to Error: "process...doesn't have the automation user"
  • in the mms-automation there is also a comment; that indicates thats they are handling the edge-case if an auth transition was not successful, they start the process with old auth to "finish" it. But this is exactly what causes our race condition
	// If a process went down unexpectedly in the middle of an auth transition,
	// we want to restart it with the old auth args.
	// Otherwise, it could be upgraded to the new auth state too soon,
	// and not be able to communicate with other shard members.

tl;dr: first node-0 moved to new auth, config not yet, node-0 restarted and during the restart config transitioned to the new auth while node-0 is again running old auth

Proof of Work

  • auth change tests are passing multiple times in a row: Link - the most flaky auth tests + Link2 - from the patch

Checklist

  • Have you linked a jira ticket and/or is the ticket in the title?
  • Have you checked whether your jira ticket required DOCSP changes?
  • Have you added changelog file?

Copy link

github-actions bot commented Aug 7, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

Bug Fixes

  • This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
  • We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
  • The agent returns ready if the cluster is ready to accept requests. The operator uses this information to continue operational actions like restarts.
  • This can be problematic during auth transitions. We can have a period where we invalidate one auth while the other is not activated yet and we try to use the not supported one

Other Changes

  • Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
  • subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.

@nammn nammn changed the title Fix scram 222 Fix auth transition on edge-cases Aug 8, 2025
@nammn nammn marked this pull request as ready for review August 8, 2025 08:59
@nammn nammn requested a review from a team as a code owner August 8, 2025 08:59
@nammn nammn requested review from m1kola, anandsyncs and Copilot August 8, 2025 08:59
Copy link

@Copilot Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.

Key changes:

  • Added authentication transition detection in the automation status checker
  • Introduced logic to wait for auth-related moves to complete before marking clusters as ready
  • Added comprehensive test coverage for authentication transition scenarios

Reviewed Changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.

File Description
controllers/om/automation_status.go Added authentication transition detection and isAuthenticationTransitionMove helper function
controllers/om/automation_status_test.go Added comprehensive test cases for authentication transition scenarios
changelog/20250808_fix_fixing_auth_transition_edge_cases.md Added changelog entry documenting the fix

Copy link
Member

@mircea-cosbuc mircea-cosbuc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM with a nit

@@ -113,17 +138,29 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin
}
}

// isAuthenticationTransitionMove returns true if the given move is related to authentication transitions
func isAuthenticationTransitionMove(move string) bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really like the approach of moving those phase specific checks outside of the readiness probe where we can only afford to treat them with a very wide brush.

func areAnyAgentsInKubeUpgradeMode(as *AutomationStatus, relevantProcesses []string, log *zap.SugaredLogger) bool {
for _, p := range as.Processes {
if !stringutil.Contains(relevantProcesses, p.Name) {
continue
}
for _, plan := range p.Plan {
for _, planStep := range p.Plan {
Copy link
Member

@mircea-cosbuc mircea-cosbuc Aug 11, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: In the code you introduced, you're ranging over p.Plan with move and this is refactored to use planStep instead. Let's stay consistent (I think we can use move since automation uses the same terminology).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants