You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
# Summary
**Add Not-Ready Handling for Ongoing Auth Transitions**:
This patch refines our readiness logic to correctly reflect the state of
authentication transitions. Previously, we treated
LastGoalVersionAchieved == GoalVersion as a signal that the cluster was
"Running", but this assumption breaks down when auth transitions are
still in progress.
This happened because we returned "ready" during a wait step
(WaitAuthCanUpdate) — and [we generally return ready for all wait
steps](https://github.com/mongodb/mongodb-kubernetes/blob/f0050b8942545701e8cb9e42d54d14f0cb58ee6a/mongodb-community-operator/cmd/readiness/main.go#L139),
regardless of whether auth is fully transitioned. Example status:
```
{
"step": "WaitAuthUpdate",
"stepDoc": "Wait to update Auth",
"isWaitStep": true,
"started": "2025-08-07T14:59:40.213178437Z",
"attempts": 512,
"latestAttempt": "2025-08-07T15:09:20.966699961Z",
"completed": null,
"result": "wait"
}
```
**Why implemented in the operator and not readinessProbe**:
I didn't fix the readinessProbe but rather the operator
* if the readinessProbe blocks new nodes are not coming up
* we want new nodes coming up
* but we also want to block new configurations being applied, which the
automation_status check in the
operator does
**The core idea:**
* Configuration applied ≠ transition fully complete.
**What happened in our tests**:
* we update auth via CR x509 -> scram
* `node-0` completed its auth transition (now uses scram, instead of
x509)
* `Config server` hasn't finished its auth transition yet
* We hit a race condition where clusters were marked as "Running" too
early and thus continued the rolling restart of `nod e-0`
* `node-0` restarted with the old X509 config (see below comment from
the agent code)
* The X509 process couldn’t access the SCRAM automation user
* Leads to Error: "process...doesn't have the automation user"
- in the mms-automation there is also a comment; that indicates thats
they are handling the edge-case if an auth transition was not
successful, they start the process with old auth to "finish" it. But
this is exactly what causes our race condition
```
// If a process went down unexpectedly in the middle of an auth transition,
// we want to restart it with the old auth args.
// Otherwise, it could be upgraded to the new auth state too soon,
// and not be able to communicate with other shard members.
```
tl;dr: first `node-0` moved to new auth, `config` not yet, `node-0`
restarted and during the restart `config` transitioned to the new auth
while `node-0` is again running old auth
## Proof of Work
- auth change tests are passing multiple times in a row:
[Link](http://spruce.mongodb.com/version/6894b98218a2e90007437e99/tasks?sorts=STATUS%3AASC%3BBASE_STATUS%3ADESC)
- the most flaky auth tests +
[Link2](https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_static_mdb_kind_ubi_cloudqa_e2e_sharded_cluster_x509_to_scram_transition_patch_b29fb4ace63eec7102f8f034fd6c553b5d75c1a1_6894c0785c119f0007a58f3c_25_08_07_15_04_26/logs?execution=0)
- from the patch
## Checklist
- [ ] Have you linked a jira ticket and/or is the ticket in the title?
- [x] Have you checked whether your jira ticket required DOCSP changes?
- [x] Have you added changelog file?
- use `skip-changelog` label if not needed
- refer to [Changelog files and Release
Notes](https://github.com/mongodb/mongodb-kubernetes/blob/master/CONTRIBUTING.md#changelog-files-and-release-notes)
section in CONTRIBUTING.md for more details
* Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.
0 commit comments