Debug flaky OpAMP client tests in GitHub Actions #85

trask · 2025-08-05T01:27:33Z

Problem

The ./gradlew opamp-client:test is flaky in GitHub Actions but not reproducible locally. This PR adds comprehensive debugging to identify and fix the issue.

Changes

Enhanced Test Debugging

Added detailed logging to all test methods with thread names and timestamps
Increased timeouts from 1s to 10s with 100ms polling intervals
Enhanced callback implementations with detailed logging
Added proper resource cleanup in tearDown method
Added stress test that runs 10 iterations to increase flakiness reproduction

GitHub Actions Debugging Workflow

Matrix strategy testing Java 17 & 21 across 5 iterations each
Sequential stress testing (20 runs)
Different JVM options testing (SerialGC, G1GC, ParallelGC)
Comprehensive system information logging
Test timing analysis and artifact collection on failures

Debug Output Examples

Tests now show:

[DEBUG] Test setUp - Thread: main, Time: 1704067200000
[DEBUG] TestCallbacks.onConnect() called - count: 1, Thread: pool-1-thread-1
[DEBUG] Current onMessageCalls: 0, Thread: main, Time: 1704067201000

Expected Outcomes

Reproduce flakiness - The stress test and multiple iterations should catch the race condition
Identify root cause - Debug logs will show timing patterns and thread issues
Environment differences - Different JVM options may reveal GC-related timing issues

Next Steps

After this PR shows the issue pattern, I'll create a follow-up PR with the actual fix.

This is specifically for debugging the flaky test issue reported in GitHub Actions.

This reverts commit 7d195ab.

trasktest

testing

- Added comprehensive logging to OpampClientImplTest to track timing issues - Increased timeouts from 1s to 10s with detailed polling info - Added thread names and timestamps to all debug output - Added stress test to reproduce flakiness (10 iterations) - Enhanced tearDown to properly clean up resources - Added GitHub Actions workflow to run tests multiple times with different JVM options - Tests now show detailed callback timing and state changes This will help identify race conditions and timing issues in CI that aren't reproducible locally.

@SuppressWarnings

- Added @SuppressWarnings for SystemOut, CatchingUnchecked, and InterruptedExceptionSwallowed - This allows the debug logging code to compile in CI environments with -Werror

The whenServerProvidesNewInstanceUid_useIt test was using reference comparison (!=) instead of content comparison (Arrays.equals) for byte arrays. This caused the test to be flaky and sometimes fail in CI environments. Changes: - Fixed array comparison from '!=' to '!Arrays.equals()' - Changed server-provided UID from {1,2,3} to {4,5,6} to ensure it differs from initial UID - Cleaned up debug logging added during investigation Fixes the flaky test behavior reported in CI runs.

This workflow tests the specific fixed test across: - Java 17 and 21 - 10 iterations each (20 total runs) - Only the whenServerProvidesNewInstanceUid_useIt test This will prove the array comparison fix eliminates flakiness.

asdf

trask force-pushed the main branch 3 times, most recently from edb606e to 2c3ef47 Compare August 5, 2025 22:46

trask added 2 commits August 5, 2025 15:47

Increase timeout in opamp-client tests

7d195ab

Revert "Increase timeout in opamp-client tests"

15b1242

This reverts commit 7d195ab.

trask force-pushed the main branch from 2c3ef47 to 15b1242 Compare August 5, 2025 23:16

TEST

c6515cd

trasktest approved these changes Aug 12, 2025

View reviewed changes

trasktest self-requested a review August 12, 2025 21:22

trasktest previously approved these changes Aug 12, 2025

View reviewed changes

trask force-pushed the flaky branch from 04f4eee to e9fb107 Compare August 12, 2025 21:29

trask added 6 commits August 12, 2025 14:30

Fix compilation error: use getRequestUrl() instead of getPath()

9a55c1e

Fix compilation errors in debugging code

974ef77

Add @SuppressWarnings to fix ErrorProne compilation errors

abb0572

- Added @SuppressWarnings for SystemOut, CatchingUnchecked, and InterruptedExceptionSwallowed - This allows the debug logging code to compile in CI environments with -Werror

Add focused workflow to verify OpAMP client fix

09047ad

This workflow tests the specific fixed test across: - Java 17 and 21 - 10 iterations each (20 total runs) - Only the whenServerProvidesNewInstanceUid_useIt test This will prove the array comparison fix eliminates flakiness.

trask force-pushed the flaky branch from e9fb107 to 09047ad Compare August 12, 2025 21:30

trasktest approved these changes Aug 12, 2025

View reviewed changes

trask force-pushed the flaky branch 2 times, most recently from e9fb107 to 09047ad Compare August 12, 2025 21:34

trask force-pushed the main branch 8 times, most recently from 2a3699a to e4093f4 Compare August 14, 2025 21:18

trask closed this Aug 18, 2025

trask deleted the flaky branch October 20, 2025 17:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Debug flaky OpAMP client tests in GitHub Actions #85

Debug flaky OpAMP client tests in GitHub Actions #85

Uh oh!

trask commented Aug 5, 2025

Uh oh!

trasktest left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Debug flaky OpAMP client tests in GitHub Actions #85

Debug flaky OpAMP client tests in GitHub Actions #85

Uh oh!

Conversation

trask commented Aug 5, 2025

Problem

Changes

Enhanced Test Debugging

GitHub Actions Debugging Workflow

Debug Output Examples

Expected Outcomes

Next Steps

Uh oh!

trasktest left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants