Skip to content

Conversation

Anshul-creator
Copy link

What

This PR addresses the flakiness in Tls13SelectorTest.testCloseOldestConnection, which intermittently failed to observe idle connection expiry under TLSv1.3.

Reproduction of flakiness

To reproduce the flaky behavior prior to this fix:

for i in {1..10}; do
  ./gradlew :clients:test \
    --tests 'org.apache.kafka.common.network.Tls13SelectorTest.testCloseOldestConnection' \
    -Pkafka.test.run.flaky=true \
    -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
    --rerun-tasks --no-daemon \
  || { echo "Failed on run $i"; break; }
done

You can increase the loop count to 20 or more if needed. In my environment, running 10 iterations was enough to reproduce the failure consistently, so I used 10.

Why the flakiness

The base test in SelectorTest assumed that once a connection is established, idle time measurement could begin immediately. This works under PLAINTEXT, but under TLSv1.3:

  • Post-handshake records (e.g., NewSessionTicket) often arrive after the handshake completes.
  • If the test jumps the mock clock before draining these records, the first poll processes them and updates the channel’s last-activity timestamp to the advanced time.
  • As a result, the connection never appears idle, and the test fails.

How

This PR makes the TLSv1.3 override robust by:

  • Waiting for the channel to be READY instead of just CONNECTED.
  • Performing a short selector.poll(50) immediately after READY to drain any post-handshake records at the current time.
  • Advancing the mock time beyond the configured idle threshold (parent uses 5s, here we use 10s for margin).
  • Using TestUtils.waitForCondition with small polls until the channel is observed as ChannelState.EXPIRED.

These changes ensure idle expiry is deterministically surfaced without altering production logic.


Verification of stability

After applying the changes, the same looped command can be used to confirm the test no longer flakes:

for i in {1..20}; do
  ./gradlew :clients:test \
    --tests 'org.apache.kafka.common.network.Tls13SelectorTest.testCloseOldestConnection' \
    -Pkafka.test.run.flaky=true \
    -PmaxParallelForks=1 -Dorg.gradle.workers.max=1 -Dtest.forkEvery=1 \
    --rerun-tasks --no-daemon \
  || { echo "Failed on run $i"; break; }
done

Here the loop count is set to 20 to provide stronger assurance of stability, though fewer iterations may already be sufficient.

Scope

  • Test-only change. No production code modified.
  • Existing assertions retained (EXPIRED state observed).

JIRA

KAFKA-14249

@github-actions github-actions bot added triage PRs from the community tests Test fixes (including flaky tests) clients small Small PRs labels Oct 1, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
clients small Small PRs tests Test fixes (including flaky tests) triage PRs from the community
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant