Fix flakes in integration suite #991

Zerpet · 2025-05-22T12:08:15Z

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Note to contributors: remember to re-generate client set if there are any API changes

Summary Of Changes

Fix deletion policy tests
Use go run for ginkgo CLI
Refactor controller tests
Fix vhost controller test flakes

Additional Context

The "retain" tests were had a flaky behaviour. Sometimes they were passing due to a flake. The "retain" tests were
creating and deleting their topology objects one call right after
the other, without asserting that the object was created (from etcd PoV). When the object deletion logic started
"too soon", before etcd recorded the object creation, the tests were passing because the controller has a logic
to skip deletion if the object is not found. They were also very sensible to test pollution (which was already
present, it was not introduced by the "retain" contribution). Other controllers could overwrite the retain policy
to its default delete, causing the "retain" tests to rightfuly fail.

To fix the test pollution, the refactor changed the manager setup, to use label selectors for the controller
informer cache. In addition to the existing namespace filtering. Due to this change, the manager has to start
later in the test setup, because the tests use the topology object name as label, and this value is not known
to the setup logic until its very close to the test.

Finally, the last remain of test pollution was in the vhost "limits" test. These tests left behind a variable
that is always used in the vhost object initialisation. When the "limits" tests ran before the "delete" tests,
these would cause a flake. This did not always happen because we (correctly) randomise the test execution order
with ginkgo.

The deletion policy was not being set at all for tests. Why? Because BeforeEach nodes run BEFORE the JustBeforeEach. Therefore, the BeforeEach was setting the deletion policy in the spec, and right after the JustAfterEach in the top container was assigning a "zero" value for the topology type. This effectively "overrode" the value set for deletion policy. This was probably passing because Create/Delete were called one after the other, and sometimes the deletion behaviour simply skipped because the object was not found, since it was "too soon" and not returned by the informer cache. These tests should have been rightfuly red. The were flaking to green.

[Why] To avoid mismatch version between Ginkgo version in Go mod vs the Ginkgo CLI installed locally in GOBIN. The change to the port number in testenv is to avoid using ports outside of valid range (1-65536) when Ginkgo parallel processes was over 10. Previously, the process number 10 would render port 81810, which is outside of the valid range.

Controller tests are flaking way too often. One of the test flakes was around the "retain" feature. These tests were buggy and should have been failing consistently; they were flaking to green. The problem was a create-delete right away, without awaiting for validation on creation. Sometimes, the create was registered in etcd, and it was causing a legitimate failure (due to test setup, production code was fine). When the create did not complete and, in parallel, the deletion executed, it skipped the deletion (because the object did not exist in etcd yet), therefore, it skipped the HTTP calls, passing the test assertion (incorrectly). The other refactor is to provide better test isolation. Some flakes were happening, rarely, due to test polution, where the manager for one test was reconciling the object for another test. After the refactor, each manager is constrained to its own test suite using label selectors and namespace isolation. This adds a layer of complexity to writing tests for controllers, however, the setup can be copy-pasted for new tests. Hopefully that will alleviate the complexity of writing these tests.

There was test pollution from tests that set vhost limits. The vhost limits in RMQ HTTP API are implemented as separate API endpoints. When a vhost has a limit, the controller declares the vhost first, then declares the limits. The flake happened when a "limits" tests ran before the "deletion" tests, and left behind a value for `vhostLimits` variable. This variable is taken unconditionally to initialise a vhost variable (as it should). The flake is simply fixed by setting the `vhostLimits` variable back to `nil` after each "limits" test. Additionally, the "creation" tests should not delete the objects after each test suite, because the fake client is not prepared to return appropriate responses to the delete requests, and that would leave the managers in an infinite reconcile loop, which can delay manager shutdown after each test.

Zerpet added 4 commits May 20, 2025 12:26

Zerpet added this to the v1.17.1 milestone May 22, 2025

Zerpet self-assigned this May 22, 2025

Zerpet requested review from MirahImage and mkuratczyk May 22, 2025 12:08

MirahImage approved these changes May 22, 2025

View reviewed changes

Zerpet mentioned this pull request May 22, 2025

Fix logging key error #992

Merged

mkuratczyk approved these changes May 23, 2025

View reviewed changes

Zerpet merged commit da9ed93 into main May 23, 2025
7 checks passed

Zerpet deleted the flake-hunting branch May 23, 2025 08:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Fix flakes in integration suite #991

Fix flakes in integration suite #991

Uh oh!

Zerpet commented May 22, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Fix flakes in integration suite #991

Fix flakes in integration suite #991

Uh oh!

Conversation

Zerpet commented May 22, 2025

Summary Of Changes

Additional Context

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants