CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters #384

nammn · 2025-08-28T15:46:08Z

Summary

fixes:

e2e_sharded_cluster_scram_sha_256_user_connectivity
e2e_replica_set_scram_sha_256_user_connectivity

both were mostly failing on master merges on openshift.
PoW shows 3x passing in a row each one

Reliability improvements for authentication tests:

Added a _wait_for_mongodbuser_reconciliation function to ensure all MongoDBUser resources reach the "Updated" phase before authentication attempts, preventing race conditions after user/password changes. This function is now called at the start of all authentication assertion methods
Increased the default number of authentication attempts 50 across all relevant methods
- Some of the tests and their logs showed that we are updating the secret and sometimes the reconcile is slow to pick them up. In the meantime we already start the verification test which its default of around 100s. That is too short and racy. In this log one can see that the reconciliation of the user took around 1m but the auth verification already started before that, thus having a total run time larger than the 100s timeout

Error handling and diagnostics:

fixed logging error, we were passing msg but that doesn't exist and thus we caused a panic when the test was failing - masking the error
Enhanced diagnostics collection in tests/conftest.py to also save the automation config JSON for each project when tests fail, aiding post-mortem analysis.

Proof of Work

green ci
green openshift tests (passed multiple times in a row)

Checklist

Have you linked a jira ticket and/or is the ticket in the title?
Have you checked whether your jira ticket required DOCSP changes?
Have you added changelog file?
- use skip-changelog label if not needed
- refer to Changelog files and Release Notes section in CONTRIBUTING.md for more details

github-actions · 2025-08-28T15:46:53Z

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

We've added comprehensive multi-architecture support for the kubernetes operator. This enhancement enables deployment on IBM Power (ppc64le) and IBM Z (s390x) architectures alongside
existing x86_64 support. Core images (operator, agent, init containers, database, readiness probe) now support multiple architectures. We do not add support IBM and ARM support for Ops-Manager and the init-ops-manager image.

MongoDB Agent images have been migrated to new container repository: quay.io/mongodb/mongodb-agent.
- the agents in the new repository will support the x86-64, ARM64, s390x, and ppc64le architectures. More can be read in the public docs.
- operator running >=MCK1.3.0 and static cannot use the agent images from the old container repository quay.io/mongodb/mongodb-agent-ubi.
quay.io/mongodb/mongodb-agent-ubi should not be used anymore, it's only there for backwards compatibility.

Bug Fixes

This change fixes the current complex and difficult-to-maintain architecture for stateful set containers, which relies on an "agent matrix" to map operator and agent versions which led to a sheer amount of images.
We solve this by shifting to a 3-container setup. This new design eliminates the need for the operator-version/agent-version matrix by adding one additional container containing all required binaries. This architecture maps to what we already do with the mongodb-database container.
Fixed an issue where the readiness probe reported the node as ready even when its authentication mechanism was not in sync with the other nodes, potentially causing premature restarts.

Other Changes

Optional permissions for PersistentVolumeClaim moved to a separate role. When managing the operator with Helm it is possible to disable permissions for PersistentVolumeClaim resources by setting operator.enablePVCResize value to false (true by default). When enabled, previously these permissions were part of the primary operator role. With this change, permissions have a separate role.
subresourceEnabled Helm value was removed. This setting used to be true by default and made it possible to exclude subresource permissions from the operator role by specifying false as the value. We are removing this configuration option, making the operator roles always have subresource permissions. This setting was introduced as a temporary solution for this OpenShift issue. The issue has since been resolved and the setting is no longer needed.
We have deliberately not published the container images for OpsManager versions 7.0.16, 8.0.8, 8.0.9 and 8.0.10 due to a bug in the OpsManager which prevents MCK customers to upgrade their OpsManager deployments to those versions.

nammn · 2025-08-29T08:52:29Z

docker/mongodb-kubernetes-tests/kubetester/mongotester.py

@@ -76,6 +77,63 @@ def fetch(self, context: OIDCCallbackContext) -> OIDCCallbackResult:
        return OIDCCallbackResult(access_token=u.id_token)


+def _wait_for_mongodbuser_reconciliation() -> None:


this is a kind of catch - all approach. Instead I could take the time to find all places where we update the secret/user and add this there. But I don't know where it is and might end up as a rabbit chase. I think that would be the correct approach - but I rather have this as a dedicated item instead.

nammn · 2025-08-29T08:52:54Z

docker/mongodb-kubernetes-tests/kubetester/mongotester.py


    def assert_scram_sha_authentication_fails(
        self,
        username: str,
        password: str,
-        retries: int = 20,
+        attempts: int = 50,


a lot of auth times we were really close to the 1m (20*5) mark. That is not a good timeout

nammn · 2025-08-29T08:54:18Z

docker/mongodb-kubernetes-tests/tests/conftest.py

+
+                    if exitstatus != 0:
+                        try:
+                            automation_config_tester = tester.get_automation_config_tester()


a lot of times the ac would help a lot (especially in those auth change cases)

docker/mongodb-kubernetes-tests/tests/conftest.py

nammn · 2025-08-29T10:01:40Z

docker/mongodb-kubernetes-tests/kubetester/opsmanager.py

@@ -1029,6 +1029,8 @@ def assert_reaches_phase(
                # This can be an intermediate error, right before we check for this secret we create it.
                # The cluster might just be slow
                "failed to locate the api key secret",
+                # etcd might be slow
+                "etcdserver: request timed out",


the reason that the multi reconcile test keeps failing is due to a large amount of appdb concurrent creation of configmaps (the automation config).

https://spruce.mongodb.com/task/mongodb_kubernetes_e2e_operator_race_ubi_with_telemetry_e2e_om_reconcile_race_with_telemetry_patch_dd5f1d83bb18ee6c258effa7ef18f6b1841f1cc6_68b16aa2a26fc80007842f79_25_08_29_08_53_56/files?execution=1&sorts=STATUS%3AASC

("Error creating automation config map in cluster __default: etcdserver: request timed out")

nammn changed the title ~~Fix openshift tests~~ CLOUDP-316922 - Fix auth tests like some run on busy clusters as in openshift tests Aug 28, 2025

nammn added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Aug 28, 2025

nammn changed the title ~~CLOUDP-316922 - Fix auth tests like some run on busy clusters as in openshift tests~~ CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters Aug 29, 2025

nammn commented Aug 29, 2025

View reviewed changes

CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters

3a66cf5

nammn force-pushed the fix-openshift-tests branch from 525b3a1 to 3a66cf5 Compare August 29, 2025 08:53

nammn commented Aug 29, 2025

View reviewed changes

nammn marked this pull request as ready for review August 29, 2025 09:23

nammn requested a review from a team as a code owner August 29, 2025 09:23

nammn requested review from fealebenpae and m1kola August 29, 2025 09:23

lucian-tosa approved these changes Aug 29, 2025

View reviewed changes

docker/mongodb-kubernetes-tests/tests/conftest.py Show resolved Hide resolved

handle etcd overloaded

0713a2a

nammn commented Aug 29, 2025

View reviewed changes

mircea-cosbuc approved these changes Aug 29, 2025

View reviewed changes

nammn added 3 commits August 29, 2025 13:14

add some logging as well

f062d2b

fix linter and python tests

2b8bfc3

continue on err on e2e log update

2ca951c

nammn merged commit 6b4107d into master Aug 29, 2025
7 of 8 checks passed

nammn deleted the fix-openshift-tests branch August 29, 2025 11:38

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters #384

CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters #384

Uh oh!

nammn commented Aug 28, 2025 •

edited

Loading

Uh oh!

github-actions bot commented Aug 28, 2025

Uh oh!

nammn Aug 29, 2025

Uh oh!

nammn Aug 29, 2025

Uh oh!

nammn Aug 29, 2025

Uh oh!

Uh oh!

nammn Aug 29, 2025

Uh oh!

Uh oh!

Uh oh!

		@@ -76,6 +77,63 @@ def fetch(self, context: OIDCCallbackContext) -> OIDCCallbackResult:
		return OIDCCallbackResult(access_token=u.id_token)


		def _wait_for_mongodbuser_reconciliation() -> None:

CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters #384

CLOUDP-316922 - Fix racy and slow auth tests like in openshift clusters #384

Uh oh!

Conversation

nammn commented Aug 28, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Proof of Work

Checklist

Uh oh!

github-actions bot commented Aug 28, 2025

MCK 1.3.0 Release Notes

New Features

Multi-Architecture Support

Bug Fixes

Other Changes

Uh oh!

nammn Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

nammn Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

nammn Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

nammn Aug 29, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

nammn commented Aug 28, 2025 •

edited

Loading