Skip to content

Conversation

Copy link
Contributor

Copilot AI commented Nov 21, 2025

Tests using Thread.sleep() are inherently flaky on overloaded CI/CD systems. When machines are under heavy load, fixed sleep durations become insufficient, causing intermittent failures.

Changes

Replaced 8 occurrences of Thread.sleep() across 7 test files with proper synchronization:

  • DefaultDelayingQueueTest: Removed unnecessary 10s sleep post-operation
  • SharedProcessorTest: Replaced sleep with CountDownLatch.await() for interrupt testing
  • DefaultWorkQueueTest: Removed artificial delays in producer/consumer threads
  • PortForwardTest: Added CountDownLatch for thread startup coordination
  • BucketRateLimiterTest: Replaced fixed sleep with polling loop
  • KubernetesReconcilerCreatorTest: Polling with timeout for queue population
  • EventCorrelatorTest: Polling loop for timestamp deltas

Example

Before:

Thread t = new Thread(() -> {
    try {
        Thread.sleep(10 * 1000);  // Hope this is enough time
    } catch (InterruptedException e) {
        interrupted[0] = true;
    }
});

After:

CountDownLatch startedLatch = new CountDownLatch(1);
Thread t = new Thread(() -> {
    try {
        startedLatch.countDown();
        new CountDownLatch(1).await();  // Wait indefinitely until interrupted
    } catch (InterruptedException e) {
        interrupted[0] = true;
    }
});
startedLatch.await();  // Ensure thread has started

Tests now complete as soon as conditions are met rather than waiting for arbitrary durations, improving both reliability and speed.

Original prompt

This section details on the original issue you should resolve

<issue_title>[Umbrella Issue] Many tests use Thread.sleep instead of synchronization primitives</issue_title>
<issue_description>Many of our tests simply have a Thread.sleep(duration) when testing multi-threaded code. This is inherently flaky.

Instead we should use synchronization classes (generally Semaphore) to synchronized the tests which is both faster in the general case and not flaky in the extreme case.

See for instance #1218 and #1219 </issue_description>

<agent_instructions>Please deflake any test which uses Thread.Sleep by replacing the sleep with synchronization primitives like countdown latches.</agent_instructions>

Comments on the Issue (you are @copilot in this section)

@brendandburns - [ ] util/src/test/java/io/kubernetes/client/informer/cache/ReflectorRunnableTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/workqueue/DefaultDelayingQueueTest.java - [ ] util/src/test/java/io/kubernetes/client/informer/cache/ProcessorListenerTest.java - [ ] util/src/test/java/io/kubernetes/client/informer/cache/SharedProcessorTest.java - [x] util/src/test/java/io/kubernetes/client/CopyTest.java - [ ] util/src/test/java/io/kubernetes/client/informer/cache/ControllerTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/workqueue/DefaultWorkQueueTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/leaderelection/LeaderElectionTest.java - [ ] util/src/test/java/io/kubernetes/client/informer/impl/DefaultSharedIndexInformerTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/controller/LeaderElectingControllerTest.java - [ ] util/src/test/java/io/kubernetes/client/PortForwardTest.java - [ ] util/src/test/java/io/kubernetes/client/AttachTest.java - [ ] util/src/test/java/io/kubernetes/client/ExecTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/workqueue/ratelimiter/BucketRateLimiterTest.java - [ ] spring/src/test/java/io/kubernetes/client/spring/extended/controller/KubernetesInformerCreatorTest.java - [ ] spring/src/test/java/io/kubernetes/client/spring/extended/controller/KubernetesReconcilerCreatorTest.java - [ ] extended/src/test/java/io/kubernetes/client/extended/event/EventCorrelatorTest.java @yue9944882 @brendandburns thanks for sorting out these potentially flaking tests.. will fix them later!

/assign</comment_new>
<comment_new>@brendandburns
@sarveshkaushal we'd welcome the help. In most cases the Thread.sleep(foo) is to wait for some other thread to complete it's work. Instead, we should create a Semaphore, then acquire() one or more permits for the executions we're waiting for, and then release() them in the Thread, then attempt to acquire() in the test thread to implement the wait.

See an example PR #1219

The reason that Thread.sleep() is flaky is because in a CI/CD system like GitHub actions the machines can be massively overloaded, which leads to much longer delays than you might otherwise see when running tests, so the sleep(..) ends up not being long enough. You can always make the sleep longer, but that makes the tests run longer (and it's still flaky in extreme cases) so using synchronization primitives is (almost) always the right answer.</comment_new>
<comment_new>@brendandburns
/lifecycle frozen</comment_new>


💬 We'd love your input! Share your thoughts on Copilot coding agent in our 2 minute survey.

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 21, 2025
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: Copilot
Once this PR has been reviewed and has the lgtm label, please ask for approval from brendandburns. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/XS Denotes a PR that changes 0-9 lines, ignoring generated files. labels Nov 21, 2025
Copilot AI changed the title [WIP] Refactor tests to use synchronization instead of Thread.sleep Replace Thread.sleep with synchronization primitives in tests Nov 21, 2025
Copilot finished work on behalf of brendandburns November 21, 2025 04:06
Copilot AI requested a review from brendandburns November 21, 2025 04:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Umbrella Issue] Many tests use Thread.sleep instead of synchronization primitives

3 participants