Skip to content

Comments

Fix Processing Partition picked up without active workers#25989

Merged
mohityadav766 merged 10 commits intomainfrom
fix-indexing
Feb 20, 2026
Merged

Fix Processing Partition picked up without active workers#25989
mohityadav766 merged 10 commits intomainfrom
fix-indexing

Conversation

@mohityadav766
Copy link
Member

@mohityadav766 mohityadav766 commented Feb 19, 2026

Describe your changes:

Fix Indexing Stuck due to Entities staying in processing state

I worked on ... because ...

Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

  • Fixed processing partitions stuck in distributed indexing: Added active partition tracking and exception handling to prevent partitions from getting permanently stuck in PROCESSING state when workers fail
  • Database reconciliation for missed entity completions: Implemented EntityCompletionTracker.reconcileFromDatabase() to catch entity promotions missed by in-memory tracking (from participant servers or stale partition reclamation)
  • Improved stale partition detection: Updated heartbeat updates to only refresh for actively processing partitions, enabling proper detection and reclamation of abandoned partitions
  • Fixed promotion logic race conditions: Changed promotion state management to mark entities before attempting index promotion, with automatic rollback if staged index lookup fails
  • Orphaned index cleanup safety: Added 30-minute age check to prevent premature deletion of recently created rebuild indices; improved distributed job recovery with lock acquisition

This will update automatically on new commits.

pmbrull
pmbrull previously approved these changes Feb 19, 2026
pmbrull
pmbrull previously approved these changes Feb 19, 2026
@gitar-bot
Copy link

gitar-bot bot commented Feb 20, 2026

🔍 CI failure analysis for 51ea34b: Multiple Playwright shards failing with E2E test flakiness (2 total failures across 1,298 tests) - UI timing assertions in Metric/Tag/Entity pages, unrelated to backend changes.

Issue

Multiple playwright-ci-postgresql jobs failed with E2E test flakiness:

Shard 5/6 (job 64252594684): 1 failure out of 667 tests

  • Entity Metric Tag test: expect(locator).toContainText(expected) failed
  • Entity DataConsumer test: expect(locator).toContainText(expected) failed
  • Entity DataSteward test: expect(locator).toContainText(expected) failed
  • Multiple test artifacts generated

Shard 6/6 (job 64252594683): 1 failure out of 632 tests

  • Glossary/Tag test: Multiple toHaveText and toBeVisible assertion timeouts (5 seconds)

Combined: 2 failures out of 1,298 total tests (0.15% failure rate)

Root Cause

These are frontend E2E test flakiness issues with common patterns:

  • Text content assertions timing out (toContainText, toHaveText)
  • Element visibility assertions failing
  • 5-second timeout threshold exceeded
  • Classic race conditions in UI rendering

Playwright E2E tests are inherently prone to flakiness due to:

  • Browser rendering timing variations
  • Asynchronous UI state updates
  • Test environment resource contention
  • Network request delays

Details

The PR modifies Java backend code for distributed search indexing:

  • DistributedSearchIndexExecutor.java
  • OpenSearchAggregationManager.java
  • OsUtils.java
  • Other backend search indexing classes

The failures are in frontend Playwright tests for Metric, Tag, Entity, Glossary UI pages. These tests validate the React frontend application behavior, not backend search indexing logic.

Overall Context

  • integration-tests-mysql-elasticsearch: Elasticsearch cluster shard failures
  • maven-collate-ci: External workflow timeout (3rd occurrence)
  • playwright-ci: Frontend E2E flakiness (2 failures across 1,298 tests)
Code Review ✅ Approved 4 resolved / 4 findings

Well-structured fix for distributed search index deadlock. The active partition tracking, DB reconciliation, volatile sink pattern, and orphaned index age check are all sound. Previous duplicate getJobWithFreshStats() call is resolved; the unreachable failPartition() finding is no longer applicable after the code restructuring.

✅ 4 resolved
Performance: Duplicate getJobWithFreshStats() call in handleCompletion

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexApp.java:891 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexApp.java:906
The handleCompletion method calls distributedExecutor.getJobWithFreshStats() twice: once at line 891-892 (the newly fixed null-safe call), and again at line 906 (outside the diff, in the existing code). The second call at line 906 doesn't have the same null-safety treatment and, more importantly, both calls likely hit the database. The distributedJob variable from the first call should be reused, or at minimum the second call should also be null-checked consistently.

Quality: Dead code: old single-arg promoteIfReady is never called

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/EntityCompletionTracker.java:111
The old promoteIfReady(String entityType) method (line 111-131) is now dead code. Both call sites at lines 107 and 198 use the new two-argument promoteIfReady(String entityType, boolean hasFailed) method introduced at line 203. The old method should be removed to avoid confusion — its logic also differs subtly (it independently reads the failed count from the map, rather than accepting a pre-computed hasFailed flag), which could cause issues if someone accidentally calls the wrong overload in the future.

Bug: Double failPartition() call can double-increment retryCount

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:644
PartitionWorker.processPartition() already catches all exceptions internally (PartitionWorker.java:261) and calls coordinator.failPartition() before returning a result — it does NOT re-throw. The outer catch block at line 644 can only be reached if an exception escapes from failPartition() or waitForSinkOperations() inside processPartition.

In that scenario, failPartition may have already executed successfully (the DB update committed), and the outer catch at line 653 calls it again. Since failPartition() is NOT idempotent — it unconditionally increments retryCount and can transition a FAILED partition back to PENDING — a double call will:

  • Double-increment retryCount, reducing effective retry attempts by 1
  • Potentially transition an already-FAILED partition back to PENDING if the first call exhausted retries

Suggested fix: Remove the coordinator.failPartition() call from the outer catch, since processPartition already handles failure transitions. The catch block should only log the error — it serves as a safety net for truly unexpected situations where the inner exception handling itself fails.

  • ✅ Unreachable failPartition() in DistributedJobParticipant catch

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@sonarqubecloud
Copy link

@mohityadav766 mohityadav766 merged commit e1b1f4b into main Feb 20, 2026
30 of 35 checks passed
@mohityadav766 mohityadav766 deleted the fix-indexing branch February 20, 2026 07:59
@github-actions
Copy link
Contributor

Failed to cherry-pick changes to the 1.11.11 branch.
Please cherry-pick the changes manually.
You can find more details here.

mohityadav766 added a commit that referenced this pull request Feb 20, 2026
* Fix Processing Partition picked up without active workers

* For Active Fixes

* fix: remove dead code and duplicate failPartition call in search index

Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com>

* Fix Processing Partition picked up without active workers (#26001)

* Initial plan

* Apply spotless formatting to fix Java checkstyle failures

Co-authored-by: mohityadav766 <105265192+mohityadav766@users.noreply.github.com>

---------

Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com>
Co-authored-by: mohityadav766 <105265192+mohityadav766@users.noreply.github.com>

* Fix OpenSerach toJsonstring

* Apply Review Comments

---------

Co-authored-by: Gitar <noreply@gitar.ai>
Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com>
Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com>
(cherry picked from commit e1b1f4b)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants