Skip to content

Conversation

@alpeb
Copy link
Member

@alpeb alpeb commented Dec 2, 2025

Policy tests are very flaky. Currently one of the main culprits is that service account creation sometimes isn't caught as an event by the watcher, blocking await_service_account until it times out after 60s. We already have in place up to 3 retries when calling cargo nextest, but these tests are sequential and the 60s timeouts start accumulating until we reach the CI job timeout at 20min.

This change first lowers the service account creation timeout down to 15s, understanding that if the watcher catches that event it will do pretty quickly or else block indefinitely. So better to fail faster and trigger the test retry ASAP.

With this change, test-policy (v1.34, linkerd, experimental) is finally passing, taking 17m due to the large number of retries:

     Summary [ 779.409s] 151 tests run: 151 passed (15 flaky), 0 skipped
   FLAKY 2/4 [   0.069s] linkerd-policy-test::admit_network_authentication rejects_invalid_cidr
   FLAKY 3/4 [  15.019s] linkerd-policy-test::e2e_audit ns_audit
   FLAKY 2/4 [  10.964s] linkerd-policy-test::e2e_authorization_policy targets_route
   FLAKY 3/4 [   3.830s] linkerd-policy-test::e2e_egress_network default_traffic_policy_http_allow
   FLAKY 2/4 [  37.004s] linkerd-policy-test::e2e_http_local_ratelimit_policy ratelimit_total
   FLAKY 2/4 [   7.947s] linkerd-policy-test::e2e_server_authorization network
   FLAKY 2/4 [   0.142s] linkerd-policy-test::inbound_http_route_status inbound_accepted_parent
   FLAKY 2/4 [   0.167s] linkerd-policy-test::inbound_http_route_status inbound_multiple_parents
   FLAKY 2/4 [   1.681s] linkerd-policy-test::outbound_api multiple_routes
   FLAKY 2/4 [   1.013s] linkerd-policy-test::outbound_api routes_without_backends
   FLAKY 3/4 [   1.153s] linkerd-policy-test::outbound_api service_with_routes_with_cross_namespace_backend
   FLAKY 2/4 [   0.282s] linkerd-policy-test::outbound_api_failure_accrual consecutive_failure_accrual
   FLAKY 3/4 [   0.290s] linkerd-policy-test::outbound_api_failure_accrual default_failure_accrual
   FLAKY 2/4 [   0.354s] linkerd-policy-test::outbound_api_http http_route_gateway_timeouts
   FLAKY 2/4 [   0.740s] linkerd-policy-test::outbound_api_http http_route_retries_and_timeouts

After having measured this, we also added a check in await_service_account to bypass the watcher logic if the SA is already in place. This resulted in the same tests taking only 12m with far less flakiness:

     Summary [ 517.330s] 151 tests run: 151 passed (2 flaky), 0 skipped
   FLAKY 2/4 [   0.941s] linkerd-policy-test::outbound_api routes_without_backends
   FLAKY 2/4 [   0.459s] linkerd-policy-test::outbound_api_tcp multiple_tcp_routes

Policy tests are very flaky. Currently one of the main culprits is that
service account creation sometimes isn't caught as an event by the
watcher, blocking `await_service_account` until it times out after 60s.
We already have in place up to 3 retries when calling `cargo nextest`,
but these tests are sequential and the 60s timeouts start accumulating
until we reach the CI job timeout at 20min.

This change lowers the service account creation timeout down to 15s,
understanding that if the watcher catches that event it will do pretty
quickly or else block indefinitely. So better to fail faster and trigger
the test retry ASAP.
@alpeb alpeb requested a review from a team as a code owner December 2, 2025 00:25
@alpeb alpeb changed the title test(policy): avoid timeouts due to flakiness test(policy): address timeouts and flakiness Dec 2, 2025
Copy link
Member

@adleong adleong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if the reduction in flakiness we get here is just because checking for the service account with a synchronous call takes time, allowing for more time for the namespace to be persisted. I.e. I wonder if the service account get is roughly equivalent to a sleep here.

I also wonder if awaiting for the namespace to show up in a watch could get rid of the flakiness entirely by guaranteeing that kubernetes is ready for us to initiate the namespaced service account watch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants