Skip to content

Conversation

@Zerpet
Copy link
Member

@Zerpet Zerpet commented May 22, 2025

Note to reviewers: remember to look at the commits in this PR and consider if they can be squashed
Note to contributors: remember to re-generate client set if there are any API changes

Summary Of Changes

  • Fix deletion policy tests
  • Use go run for ginkgo CLI
  • Refactor controller tests
  • Fix vhost controller test flakes

Additional Context

The "retain" tests were had a flaky behaviour. Sometimes they were passing due to a flake. The "retain" tests were
creating and deleting their topology objects one call right after
the other, without asserting that the object was created (from etcd PoV). When the object deletion logic started
"too soon", before etcd recorded the object creation, the tests were passing because the controller has a logic
to skip deletion if the object is not found. They were also very sensible to test pollution (which was already
present, it was not introduced by the "retain" contribution). Other controllers could overwrite the retain policy
to its default delete, causing the "retain" tests to rightfuly fail.

To fix the test pollution, the refactor changed the manager setup, to use label selectors for the controller
informer cache. In addition to the existing namespace filtering. Due to this change, the manager has to start
later in the test setup, because the tests use the topology object name as label, and this value is not known
to the setup logic until its very close to the test.

Finally, the last remain of test pollution was in the vhost "limits" test. These tests left behind a variable
that is always used in the vhost object initialisation. When the "limits" tests ran before the "delete" tests,
these would cause a flake. This did not always happen because we (correctly) randomise the test execution order
with ginkgo.

Zerpet added 4 commits May 20, 2025 12:26
The deletion policy was not being set at all for tests. Why? Because
BeforeEach nodes run BEFORE the JustBeforeEach. Therefore, the
BeforeEach was setting the deletion policy in the spec, and right after
the JustAfterEach in the top container was assigning a "zero" value for
the topology type. This effectively "overrode" the value set for
deletion policy.

This was probably passing because Create/Delete were called one after
the other, and sometimes the deletion behaviour simply skipped because
the object was not found, since it was "too soon" and not returned by
the informer cache.

These tests should have been rightfuly red. The were flaking to green.
[Why]
To avoid mismatch version between Ginkgo version in Go mod vs the Ginkgo
CLI installed locally in GOBIN.

The change to the port number in testenv is to avoid using ports outside
of valid range (1-65536) when Ginkgo parallel processes was over 10.
Previously, the process number 10 would render port 81810, which is
outside of the valid range.
Controller tests are flaking way too often. One of the test flakes was
around the "retain" feature. These tests were buggy and should have been
failing consistently; they were flaking to green. The problem was a
create-delete right away, without awaiting for validation on creation.
Sometimes, the create was registered in etcd, and it was causing a
legitimate failure (due to test setup, production code was fine). When
the create did not complete and, in parallel, the deletion executed, it
skipped the deletion (because the object did not exist in etcd yet),
therefore, it skipped the HTTP calls, passing the test assertion
(incorrectly).

The other refactor is to provide better test isolation. Some flakes were
happening, rarely, due to test polution, where the manager for one test
was reconciling the object for another test. After the refactor, each
manager is constrained to its own test suite using label selectors and
namespace isolation. This adds a layer of complexity to writing tests
for controllers, however, the setup can be copy-pasted for new tests.
Hopefully that will alleviate the complexity of writing these tests.
There was test pollution from tests that set vhost limits. The vhost
limits in RMQ HTTP API are implemented as separate API endpoints. When a
vhost has a limit, the controller declares the vhost first, then
declares the limits. The flake happened when a "limits" tests ran before
the "deletion" tests, and left behind a value for `vhostLimits`
variable. This variable is taken unconditionally to initialise a vhost
variable (as it should). The flake is simply fixed by setting the
`vhostLimits` variable back to `nil` after each "limits" test.

Additionally, the "creation" tests should not delete the objects after
each test suite, because the fake client is not prepared to return
appropriate responses to the delete requests, and that would leave the
managers in an infinite reconcile loop, which can delay manager shutdown
after each test.
@Zerpet Zerpet added this to the v1.17.1 milestone May 22, 2025
@Zerpet Zerpet self-assigned this May 22, 2025
@Zerpet Zerpet requested review from MirahImage and mkuratczyk May 22, 2025 12:08
@Zerpet Zerpet mentioned this pull request May 22, 2025
@Zerpet Zerpet merged commit da9ed93 into main May 23, 2025
7 checks passed
@Zerpet Zerpet deleted the flake-hunting branch May 23, 2025 08:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants