Fix Processing Partition picked up without active workers by mohityadav766 · Pull Request #25989 · open-metadata/OpenMetadata

mohityadav766 · 2026-02-19T11:55:08Z

Describe your changes:

Fix Indexing Stuck due to Entities staying in processing state

I worked on ... because ...

Type of change:

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Summary by Gitar

Fixed processing partitions stuck in distributed indexing: Added active partition tracking and exception handling to prevent partitions from getting permanently stuck in PROCESSING state when workers fail
Database reconciliation for missed entity completions: Implemented EntityCompletionTracker.reconcileFromDatabase() to catch entity promotions missed by in-memory tracking (from participant servers or stale partition reclamation)
Improved stale partition detection: Updated heartbeat updates to only refresh for actively processing partitions, enabling proper detection and reclamation of abandoned partitions
Fixed promotion logic race conditions: Changed promotion state management to mark entities before attempting index promotion, with automatic rollback if staged index lookup fails
Orphaned index cleanup safety: Added 30-minute age check to prevent premature deletion of recently created rebuild indices; improved distributed job recovery with lock acquisition

_{This will update automatically on new commits.}

...penmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java

gitar-bot · 2026-02-20T06:17:11Z

🔍 CI failure analysis for 51ea34b: Multiple Playwright shards failing with E2E test flakiness (2 total failures across 1,298 tests) - UI timing assertions in Metric/Tag/Entity pages, unrelated to backend changes.

Issue

Multiple playwright-ci-postgresql jobs failed with E2E test flakiness:

Shard 5/6 (job 64252594684): 1 failure out of 667 tests

Entity Metric Tag test: expect(locator).toContainText(expected) failed
Entity DataConsumer test: expect(locator).toContainText(expected) failed
Entity DataSteward test: expect(locator).toContainText(expected) failed
Multiple test artifacts generated

Shard 6/6 (job 64252594683): 1 failure out of 632 tests

Glossary/Tag test: Multiple toHaveText and toBeVisible assertion timeouts (5 seconds)

Combined: 2 failures out of 1,298 total tests (0.15% failure rate)

Root Cause

These are frontend E2E test flakiness issues with common patterns:

Text content assertions timing out (toContainText, toHaveText)
Element visibility assertions failing
5-second timeout threshold exceeded
Classic race conditions in UI rendering

Playwright E2E tests are inherently prone to flakiness due to:

Browser rendering timing variations
Asynchronous UI state updates
Test environment resource contention
Network request delays

Details

The PR modifies Java backend code for distributed search indexing:

DistributedSearchIndexExecutor.java
OpenSearchAggregationManager.java
OsUtils.java
Other backend search indexing classes

The failures are in frontend Playwright tests for Metric, Tag, Entity, Glossary UI pages. These tests validate the React frontend application behavior, not backend search indexing logic.

Overall Context

integration-tests-mysql-elasticsearch: Elasticsearch cluster shard failures
maven-collate-ci: External workflow timeout (3rd occurrence)
playwright-ci: Frontend E2E flakiness (2 failures across 1,298 tests)

Code Review ✅ Approved 4 resolved / 4 findings

Well-structured fix for distributed search index deadlock. The active partition tracking, DB reconciliation, volatile sink pattern, and orphaned index age check are all sound. Previous duplicate getJobWithFreshStats() call is resolved; the unreachable failPartition() finding is no longer applicable after the code restructuring.

✅ 4 resolved

✅ Performance: Duplicate getJobWithFreshStats() call in handleCompletion

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexApp.java:891 📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/SearchIndexApp.java:906
The handleCompletion method calls distributedExecutor.getJobWithFreshStats() twice: once at line 891-892 (the newly fixed null-safe call), and again at line 906 (outside the diff, in the existing code). The second call at line 906 doesn't have the same null-safety treatment and, more importantly, both calls likely hit the database. The distributedJob variable from the first call should be reused, or at minimum the second call should also be null-checked consistently.

✅ Quality: Dead code: old single-arg promoteIfReady is never called

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/EntityCompletionTracker.java:111
The old promoteIfReady(String entityType) method (line 111-131) is now dead code. Both call sites at lines 107 and 198 use the new two-argument promoteIfReady(String entityType, boolean hasFailed) method introduced at line 203. The old method should be removed to avoid confusion — its logic also differs subtly (it independently reads the failed count from the map, rather than accepting a pre-computed hasFailed flag), which could cause issues if someone accidentally calls the wrong overload in the future.

✅ Bug: Double failPartition() call can double-increment retryCount

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java:644
PartitionWorker.processPartition() already catches all exceptions internally (PartitionWorker.java:261) and calls coordinator.failPartition() before returning a result — it does NOT re-throw. The outer catch block at line 644 can only be reached if an exception escapes from failPartition() or waitForSinkOperations() inside processPartition.

In that scenario, failPartition may have already executed successfully (the DB update committed), and the outer catch at line 653 calls it again. Since failPartition() is NOT idempotent — it unconditionally increments retryCount and can transition a FAILED partition back to PENDING — a double call will:

Double-increment retryCount, reducing effective retry attempts by 1

Potentially transition an already-FAILED partition back to PENDING if the first call exhausted retries

Suggested fix: Remove the coordinator.failPartition() call from the outer catch, since processPartition already handles failure transitions. The catch block should only log the error — it serves as a safety net for truly unexpected situations where the inner exception handling itself fails.

✅ Unreachable failPartition() in DistributedJobParticipant catch

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

sonarqubecloud · 2026-02-20T07:15:09Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-02-20T08:00:37Z

Failed to cherry-pick changes to the 1.11.11 branch.
Please cherry-pick the changes manually.
You can find more details here.

* Fix Processing Partition picked up without active workers * For Active Fixes * fix: remove dead code and duplicate failPartition call in search index Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com> * Fix Processing Partition picked up without active workers (#26001) * Initial plan * Apply spotless formatting to fix Java checkstyle failures Co-authored-by: mohityadav766 <105265192+mohityadav766@users.noreply.github.com> --------- Co-authored-by: copilot-swe-agent[bot] <198982749+Copilot@users.noreply.github.com> Co-authored-by: mohityadav766 <105265192+mohityadav766@users.noreply.github.com> * Fix OpenSerach toJsonstring * Apply Review Comments --------- Co-authored-by: Gitar <noreply@gitar.ai> Co-authored-by: mohityadav766 <mohityadav766@users.noreply.github.com> Co-authored-by: Copilot <198982749+Copilot@users.noreply.github.com> (cherry picked from commit e1b1f4b)

Fix Processing Partition picked up without active workers

c00cbbf

mohityadav766 self-assigned this Feb 19, 2026

mohityadav766 had a problem deploying to test February 19, 2026 11:55 — with GitHub Actions Error

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Feb 19, 2026

mohityadav766 added the To release Will cherry-pick this PR into the release branch label Feb 19, 2026

mohityadav766 had a problem deploying to test February 19, 2026 11:56 — with GitHub Actions Error

Merge branch 'main' into fix-indexing

7b17f73

mohityadav766 temporarily deployed to test February 19, 2026 12:00 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 19, 2026 12:00 — with GitHub Actions Error

mohityadav766 temporarily deployed to test February 19, 2026 12:00 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 19, 2026 12:00 — with GitHub Actions Error

mohityadav766 temporarily deployed to test February 19, 2026 12:00 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 19, 2026 12:00 — with GitHub Actions Error

pmbrull previously approved these changes Feb 19, 2026

View reviewed changes

gitar-bot bot reviewed Feb 19, 2026

View reviewed changes

...penmetadata/service/apps/bundles/searchIndex/distributed/DistributedSearchIndexExecutor.java Show resolved Hide resolved

For Active Fixes

b877643

mohityadav766 dismissed pmbrull’s stale review via b877643 February 19, 2026 13:14

mohityadav766 temporarily deployed to test February 19, 2026 13:14 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 19, 2026 13:14 — with GitHub Actions Failure

Merge branch 'main' into fix-indexing

2e55e54

mohityadav766 temporarily deployed to test February 19, 2026 16:31 — with GitHub Actions Inactive

pmbrull previously approved these changes Feb 19, 2026

View reviewed changes

mohityadav766 added 2 commits February 20, 2026 11:21

Fix OpenSerach toJsonstring

52acc7a

Merge remote-tracking branch 'origin/fix-indexing' into fix-indexing

9ebe2dc

mohityadav766 dismissed pmbrull’s stale review via 9ebe2dc February 20, 2026 05:51

mohityadav766 had a problem deploying to test February 20, 2026 05:51 — with GitHub Actions Error

Apply Review Comments

51ea34b

mohityadav766 temporarily deployed to test February 20, 2026 06:12 — with GitHub Actions Inactive

mohityadav766 had a problem deploying to test February 20, 2026 06:12 — with GitHub Actions Failure

mohityadav766 temporarily deployed to test February 20, 2026 06:12 — with GitHub Actions Inactive

mohityadav766 merged commit e1b1f4b into main Feb 20, 2026
30 of 35 checks passed

mohityadav766 deleted the fix-indexing branch February 20, 2026 07:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

Fix Processing Partition picked up without active workers#25989

Fix Processing Partition picked up without active workers#25989
mohityadav766 merged 10 commits intomainfrom
fix-indexing

mohityadav766 commented Feb 19, 2026 •

edited by gitar-bot bot

Loading

Uh oh!

Uh oh!

gitar-bot bot commented Feb 20, 2026 •

edited

Loading

Issue

Root Cause

Details

Overall Context

Uh oh!

sonarqubecloud bot commented Feb 20, 2026

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

Conversation

mohityadav766 commented Feb 19, 2026 • edited by gitar-bot bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

Checklist:

Summary by Gitar

Uh oh!

Uh oh!

gitar-bot bot commented Feb 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Issue

Root Cause

Details

Overall Context

Uh oh!

sonarqubecloud bot commented Feb 20, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

github-actions bot commented Feb 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

mohityadav766 commented Feb 19, 2026 •

edited by gitar-bot bot

Loading

gitar-bot bot commented Feb 20, 2026 •

edited

Loading