Skip to content

Conversation

@trask
Copy link
Owner

@trask trask commented Aug 5, 2025

Problem

The ./gradlew opamp-client:test is flaky in GitHub Actions but not reproducible locally. This PR adds comprehensive debugging to identify and fix the issue.

Changes

Enhanced Test Debugging

  • Added detailed logging to all test methods with thread names and timestamps
  • Increased timeouts from 1s to 10s with 100ms polling intervals
  • Enhanced callback implementations with detailed logging
  • Added proper resource cleanup in tearDown method
  • Added stress test that runs 10 iterations to increase flakiness reproduction

GitHub Actions Debugging Workflow

  • Matrix strategy testing Java 17 & 21 across 5 iterations each
  • Sequential stress testing (20 runs)
  • Different JVM options testing (SerialGC, G1GC, ParallelGC)
  • Comprehensive system information logging
  • Test timing analysis and artifact collection on failures

Debug Output Examples

Tests now show:

  • [DEBUG] Test setUp - Thread: main, Time: 1704067200000
  • [DEBUG] TestCallbacks.onConnect() called - count: 1, Thread: pool-1-thread-1
  • [DEBUG] Current onMessageCalls: 0, Thread: main, Time: 1704067201000

Expected Outcomes

  1. Reproduce flakiness - The stress test and multiple iterations should catch the race condition
  2. Identify root cause - Debug logs will show timing patterns and thread issues
  3. Environment differences - Different JVM options may reveal GC-related timing issues

Next Steps

After this PR shows the issue pattern, I'll create a follow-up PR with the actual fix.

This is specifically for debugging the flaky test issue reported in GitHub Actions.

@trask trask force-pushed the main branch 3 times, most recently from edb606e to 2c3ef47 Compare August 5, 2025 22:46
Copy link
Collaborator

@trasktest trasktest left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

testing

@trasktest trasktest self-requested a review August 12, 2025 21:22
trasktest
trasktest previously approved these changes Aug 12, 2025
trask added 6 commits August 12, 2025 14:30
- Added comprehensive logging to OpampClientImplTest to track timing issues
- Increased timeouts from 1s to 10s with detailed polling info
- Added thread names and timestamps to all debug output
- Added stress test to reproduce flakiness (10 iterations)
- Enhanced tearDown to properly clean up resources
- Added GitHub Actions workflow to run tests multiple times with different JVM options
- Tests now show detailed callback timing and state changes

This will help identify race conditions and timing issues in CI that aren't
reproducible locally.
- Added @SuppressWarnings for SystemOut, CatchingUnchecked, and InterruptedExceptionSwallowed
- This allows the debug logging code to compile in CI environments with -Werror
The whenServerProvidesNewInstanceUid_useIt test was using reference comparison
(!=) instead of content comparison (Arrays.equals) for byte arrays. This caused
the test to be flaky and sometimes fail in CI environments.

Changes:
- Fixed array comparison from '!=' to '!Arrays.equals()'
- Changed server-provided UID from {1,2,3} to {4,5,6} to ensure it differs from initial UID
- Cleaned up debug logging added during investigation

Fixes the flaky test behavior reported in CI runs.
This workflow tests the specific fixed test across:
- Java 17 and 21
- 10 iterations each (20 total runs)
- Only the whenServerProvidesNewInstanceUid_useIt test

This will prove the array comparison fix eliminates flakiness.
@trask trask force-pushed the flaky branch 2 times, most recently from e9fb107 to 09047ad Compare August 12, 2025 21:34
@trask trask force-pushed the main branch 8 times, most recently from 2a3699a to e4093f4 Compare August 14, 2025 21:18
@trask trask closed this Aug 18, 2025
@trask trask deleted the flaky branch October 20, 2025 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants