Skip to content

Conversation

@JSOD11
Copy link
Contributor

@JSOD11 JSOD11 commented Nov 11, 2025

Resolves #126: Fix flaky test

I believe the intention of the test is as follows:

  • Create a TTL map with ttl: 10ms and tick: 2ms (garbage collector runs every tick, entries expire after 5 ticks)
  • Spawn tokio tasks that perform various asynchronous insertions and deletions on the map
  • Wait until the tokio tasks complete
  • Assert that the map empties itself within 2x the TTL (20ms)

Say the final insertion of an entry into the map by an arbitrary tokio task occurs at time T. Then in an idealized setting we'd expect this entry to be removed by the ttl map garbage collector at time T + TTL, so it seems reasonable to assert that the map should be empty by T + 2 * TTL.

The issue is:

  • Our TTL here is very small, 10ms
  • The way we actually perform this check within the test is with a function called assert_eventually (defined just above the test) that performs a check every 10ms
  • The garbage collector here is running every 2ms independent of insertion time
  • Crowded CI machines will additionally suffer delays of a few ms
image

In the image, the blue lines represent the garbage collection operation, which in this configuration will occur every 2ms (size of tick). It's possible for the final entry to be inserted just after a gc cycle, meaning that when assert_eventually reaches T+10ms, the final entry has not expired. The entry is collected just afterwards, but looking at the way assert_eventually is written, it does not actually check the final bound, meaning it is only checking 1 * TTL in this configuration, leading to the flake.

Combining the above with crowded CI machines that may also arbitrarily delay operations by a few ms, this test becomes noticeably flaky.

To resolve this problem, we boost the window to 100ms, which should be more than enough to resolve flakiness and will not meaningfully increase CI time.

@JSOD11 JSOD11 force-pushed the jsod/fix-flake-11-10-25 branch from 83c1762 to af16581 Compare November 13, 2025 23:26
@JSOD11 JSOD11 changed the title Relax timing constraints fix(datafusion-distributed): relax ttl map timing constraints Nov 13, 2025
@JSOD11 JSOD11 marked this pull request as ready for review November 13, 2025 23:41
Copy link
Collaborator

@gabotechs gabotechs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good explanation! +1

@gabotechs gabotechs merged commit dd853a7 into datafusion-contrib:main Nov 14, 2025
4 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Flaky 'common::ttl_map::tests::test_concurrent_gc_and_access' test

2 participants