-
Notifications
You must be signed in to change notification settings - Fork 13
Fix auth transition on edge-cases #321
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: master
Are you sure you want to change the base?
Conversation
MCK 1.3.0 Release NotesBug Fixes
Other Changes
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pull Request Overview
This PR fixes a race condition in authentication transitions by refining the readiness logic to correctly handle ongoing auth transitions. Previously, the system would mark clusters as "Running" too early when LastGoalVersionAchieved == GoalVersion
, even when authentication transitions were still in progress, leading to process restarts with incorrect auth configurations.
Key changes:
- Added authentication transition detection in the automation status checker
- Introduced logic to wait for auth-related moves to complete before marking clusters as ready
- Added comprehensive test coverage for authentication transition scenarios
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated no comments.
File | Description |
---|---|
controllers/om/automation_status.go | Added authentication transition detection and isAuthenticationTransitionMove helper function |
controllers/om/automation_status_test.go | Added comprehensive test cases for authentication transition scenarios |
changelog/20250808_fix_fixing_auth_transition_edge_cases.md | Added changelog entry documenting the fix |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM with a nit
@@ -113,17 +138,29 @@ func checkAutomationStatusIsGoal(as *AutomationStatus, relevantProcesses []strin | |||
} | |||
} | |||
|
|||
// isAuthenticationTransitionMove returns true if the given move is related to authentication transitions | |||
func isAuthenticationTransitionMove(move string) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I really like the approach of moving those phase specific checks outside of the readiness probe where we can only afford to treat them with a very wide brush.
func areAnyAgentsInKubeUpgradeMode(as *AutomationStatus, relevantProcesses []string, log *zap.SugaredLogger) bool { | ||
for _, p := range as.Processes { | ||
if !stringutil.Contains(relevantProcesses, p.Name) { | ||
continue | ||
} | ||
for _, plan := range p.Plan { | ||
for _, planStep := range p.Plan { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: In the code you introduced, you're ranging over p.Plan
with move
and this is refactored to use planStep
instead. Let's stay consistent (I think we can use move
since automation uses the same terminology).
Summary
Add Not-Ready Handling for Ongoing Auth Transitions:
This patch refines our readiness logic to correctly reflect the state of authentication transitions. Previously, we treated LastGoalVersionAchieved == GoalVersion as a signal that the cluster was "Running", but this assumption breaks down when auth transitions are still in progress.
This happened because we returned "ready" during a wait step (WaitAuthCanUpdate) — and we generally return ready for all wait steps, regardless of whether auth is fully transitioned. Example status:
Why implemented in the operator and not readinessProbe:
I didn't fix the readinessProbe but rather the operator
operator does
The core idea:
What happened in our tests:
node-0
completed its auth transition (now uses scram, instead of x509)Config server
hasn't finished its auth transition yetnod e-0
node-0
restarted with the old X509 config (see below comment from the agent code)tl;dr: first
node-0
moved to new auth,config
not yet,node-0
restarted and during the restartconfig
transitioned to the new auth whilenode-0
is again running old authProof of Work
Checklist
skip-changelog
label if not needed