Skip to content

Fix: Flaky Workflow Test cross polluted by SystemResource test#25789

Merged
yan-3005 merged 1 commit intomainfrom
ram/fix-workflows-flaky-test
Feb 10, 2026
Merged

Fix: Flaky Workflow Test cross polluted by SystemResource test#25789
yan-3005 merged 1 commit intomainfrom
ram/fix-workflows-flaky-test

Conversation

@yan-3005
Copy link
Contributor

@yan-3005 yan-3005 commented Feb 10, 2026

Describe your changes:

Fixes

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Fixed cross-test pollution:
    • test_assetCertificationSettings in SystemResourceIT.java now resets AssetCertificationSettings to defaults after test completion
  • Defensive setup:
    • setupCertificationTags_SDK in WorkflowDefinitionResourceIT.java ensures Certification.Gold tag exists and resets settings before workflow tests
  • Test isolation improvement:
    • Prevents flaky failures caused by shared state modifications between test classes

This will update automatically on new commits.


@gitar-bot
Copy link

gitar-bot bot commented Feb 10, 2026

🔍 CI failure analysis for 0c617ff: CI failures span backend integration tests (search indexing timeouts) and frontend Playwright tests (UI timing issues). All failures are unrelated to this PR's AssetCertificationSettings cleanup changes.

Issue

The CI failures across multiple test suites are pre-existing flaky tests unrelated to this PR's changes. This PR only modifies test cleanup logic for AssetCertificationSettings in SystemResourceIT and WorkflowDefinitionResourceIT, while failures occur in completely different areas.

Root Cause

Two Categories of Unrelated Failures:

1. Backend Integration Tests: Search Indexing Infrastructure Issues

All failing backend tests involve Elasticsearch/OpenSearch search indexing operations experiencing timing issues and race conditions:

  • AppsResourceIT.test_triggerApp_200 (Lines 382-384): SearchIndexingApplication conflict - "Job is already running" after 2-minute timeout
  • ColumnGridResourceIT.test_getColumnGrid_withServiceFilter (Line 244): Created table columns not appearing in search results
  • ColumnGridResourceIT.test_getColumnGrid_withGlossaryTermsFilter (Line 1037): Glossary-filtered columns missing from search results
  • APICollectionResourceIT.updateDescriptionAndCheckInSearch (Line 4739): Entity not appearing in OpenSearch index within 60s timeout
  • ClassificationResourceIT.updateDescriptionAndCheckInSearch (BaseEntityIT:4739): Entity not appearing in search index within 60s timeout
  • SearchResourceIT.testEntityTypeCountsWithEmptyResults (Line 794): OpenSearch client error - "Missing required property 'ShardFailure.primary'"
  • DataProductDomainMigrationIT tests (Lines 302, 1072): Assets not appearing in domain endpoint - expected 2, found 0

2. Frontend E2E Tests: Playwright Browser/Timing Issues

Playwright tests failing due to UI timing and browser stability issues:

Hard Failures:

  • Roles.spec.ts:512: Strict mode violation - 2 loader elements visible when expecting 0
  • CustomizeWidgets.spec.ts: Test timeout - browser/page closed unexpectedly while waiting for KPI API responses

Flaky Tests (8 passed on retry):

  • HyperlinkCustomProperty, Tag, Users, AutoPilot, ColumnBulkOperations, SysadminUsers specs
  • All exhibit element timing issues, navigation timeouts, or UI state inconsistencies

Details

Infrastructure Problems:

Elasticsearch/OpenSearch Shard Failures: Recurring search_phase_execution_exception: all shards failed errors throughout test runs at: 12:03, 12:05, 12:06, 12:07, 12:14

OpenSearch Mapping Exception:

mapper_parsing_exception: failed to parse field [lifeCycle] of type [keyword]
in document with id '2c1af314-c907-4f35-a902-9d052a0ecd3c'

Search index schema mismatch - expects keyword string but receives object.

Thread Interruption: Event processing threads forcefully interrupted during shutdown (LMAX Disruptor cleanup issues)

Browser Instability: Playwright tests experiencing unexpected page closures and element rendering delays in CI environment

Why Unrelated to This PR:

  • This PR modifies: Only test cleanup in SystemResourceIT.java and WorkflowDefinitionResourceIT.java for AssetCertificationSettings
  • Failing tests involve: Search indexing (AppsResource, ColumnGrid, Classification, APICollection, SearchResource, DataProductDomainMigration) and UI components (Roles, Widgets, Users, Tags, AutoPilot)
  • No functional overlap: PR adds defensive test cleanup code; does not touch search infrastructure, UI components, or any failing test areas
  • High pass rates: Backend ~99.97% (10,619/10,622 tests), Frontend ~93-96% (981/1032 tests) - indicates environmental flakiness, not systematic bugs
  • Consistent pattern across environments: Same failures in both MySQL+Elasticsearch and PostgreSQL+OpenSearch configurations
Code Review 👍 Approved with suggestions 0 resolved / 1 findings

Good fix for cross-test pollution with a two-pronged approach (source cleanup + defensive initialization). One minor suggestion to use try/finally for more robust cleanup.

💡 Edge Case: Reset in SystemResourceIT skipped if assertions fail

📄 openmetadata-integration-tests/src/test/java/org/openmetadata/it/tests/SystemResourceIT.java:1031

The cleanup code that resets AssetCertificationSettings back to defaults (lines 1031-1044) runs after the assertions on lines 1028-1029. If either assertEquals fails (e.g., due to a regression in the settings update API), the reset code is never reached, and the cross-test pollution persists.

A more robust pattern would wrap the test logic in a try/finally block:

try {
    // ... update settings and assert ...
    assertEquals("NewCertification", updatedCertificationConfig.getAllowedClassification());
    assertEquals("P60D", updatedCertificationConfig.getValidityPeriod());
} finally {
    // Reset to original values to avoid cross-test pollution
    certificationConfig.setAllowedClassification("Certification");
    certificationConfig.setValidityPeriod("P30D");
    // ... PUT reset settings ...
}

This is minor because the defensive setup in WorkflowDefinitionResourceIT.setupCertificationTags_SDK() already guards against this scenario from the consumer side. However, the try/finally pattern would make the source-side cleanup more robust and protect any other tests that might depend on these settings in the future.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@yan-3005 yan-3005 merged commit d8265b7 into main Feb 10, 2026
31 of 51 checks passed
@yan-3005 yan-3005 deleted the ram/fix-workflows-flaky-test branch February 10, 2026 15:36
@github-actions
Copy link
Contributor

Failed to cherry-pick changes to the 1.11.9 branch.
Please cherry-pick the changes manually.
You can find more details here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants