Skip to content

Fix flaky gauge tests in TelemetryMetricsEnabledSanityIT#21020

Open
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:fix/flaky-gauge-tests
Open

Fix flaky gauge tests in TelemetryMetricsEnabledSanityIT#21020
andrross wants to merge 1 commit intoopensearch-project:mainfrom
andrross:fix/flaky-gauge-tests

Conversation

@andrross
Copy link
Copy Markdown
Member

Both testGauge and testGaugeWithValueAndTagSupplier used a hardcoded metric name (test-gauge) and relied on fixed Thread.sleep durations for synchronization, making them flaky.

The shared metric name caused cross-test pollution through the shared InMemorySingletonMetricsExporter, which is the likely cause of the failure in build 73291 (expected 3.0 but was 5.0). Each test now uses a randomized metric name to isolate its metrics.

Replace the initial Thread.sleep with assertBusy polling to wait for gauge values to be published. For the post-close assertion, use assertBusy to retry a check that snapshots the callback counter, waits longer than the publish interval, and verifies it has not changed. This handles the case where an in-flight collection that started before close() is still draining.

Resolves #19422

@andrross andrross requested a review from a team as a code owner March 27, 2026 18:13
@github-actions github-actions bot added >test-failure Test failure from CI, local build, etc. autocut flaky-test Random test failure that succeeds on second run labels Mar 27, 2026
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bcd28b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bcd28b9: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 9cf800a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for 9cf800a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Both testGauge and testGaugeWithValueAndTagSupplier used a hardcoded
metric name ("test-gauge") and relied on fixed Thread.sleep durations
for synchronization, making them flaky.

The shared metric name caused cross-test pollution through the shared
InMemorySingletonMetricsExporter, which is the likely cause of the
failure in build 73291 (expected 3.0 but was 5.0). Each test now uses
a randomized metric name to isolate its metrics.

Replace the initial Thread.sleep with assertBusy polling to wait for
gauge values to be published. For the post-close assertion, use
assertBusy to retry a check that snapshots the callback counter,
waits longer than the publish interval, and verifies it has not
changed. This handles the case where an in-flight collection that
started before close() is still draining.

Resolves opensearch-project#19422

Signed-off-by: Andrew Ross <andrross@amazon.com>
@andrross andrross force-pushed the fix/flaky-gauge-tests branch from 9cf800a to bba7756 Compare March 30, 2026 20:09
@github-actions
Copy link
Copy Markdown
Contributor

Failed to generate code suggestions for PR

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bba7756: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@github-actions
Copy link
Copy Markdown
Contributor

❌ Gradle check result for bba7756: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

autocut flaky-test Random test failure that succeeds on second run skip-changelog >test-failure Test failure from CI, local build, etc.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[AUTOCUT] Gradle Check Flaky Test Report for TelemetryMetricsEnabledSanityIT

1 participant