feat: Parallelize DataAssetsWorkflow with virtual threads (#25808) by manerow · Pull Request #25817 · open-metadata/OpenMetadata

manerow · 2026-02-11T12:25:04Z

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads, reducing wall-clock time by ~2.6x on a dataset of 8,292 entities.

I worked on improving the performance of the Data Insights pipeline because the DataAssetsWorkflow was executing sequentially and spending significant time in blocking database calls during entity enrichment. Since enrichment is heavily I/O-bound (multiple DB round-trips per entity), virtual threads allow efficient concurrency without exhausting platform threads or the DB connection pool.

What changed

DataAssetsWorkflow now processes entities concurrently using a virtual-thread-per-task executor with a semaphore-based concurrency budget:
```
Math.max(4, Math.min(cores * 2, poolSize / 2))
```
- Primary signal: cores × 2
- Hard cap: poolSize / 2
- Minimum: 4
This scales with machine capacity while keeping half of the DB pool free for REST/API traffic and other jobs.
Added enrichSingle() to DataInsightsEntityEnricherProcessor so individual entities can be enriched independently on virtual threads.
Enriched results are collected in a ConcurrentLinkedQueue and bulk-flushed to the search index after each batch.
Made updateStats() methods synchronized across:
- DataInsightsElasticSearchProcessor
- DataInsightsOpenSearchProcessor
- DataInsightsEntityEnricherProcessor
- ElasticSearchIndexSink
- OpenSearchIndexSink
  to ensure thread-safe stat accumulation.
Added graceful stop support: DataInsightsApp.stop() now propagates to the active DataAssetsWorkflow, which shuts down its executor and sets a stopped flag.

Why virtual threads instead of reusing `SearchIndexApp`’s producer-consumer model

SearchIndexExecutor is optimized for:

Read → Index (1 entity → 1 document)

Its bottleneck is Elasticsearch I/O, and it uses platform thread pools, blocking queues, adaptive batching, and async bulk sinks.

DataAssetsWorkflow differs:

I/O-bound enrichment per entity
Each entity performs 3–5+ blocking DB calls (version history + owner/team resolution).
1:N data amplification
One entity can produce 30+ daily snapshot documents, making fixed queue sizing awkward.
4-stage pipeline

Read → Enrich → Process → Sink
The bottleneck is enrichment (middle stage), not read or sink.
Less complexity
Virtual threads + semaphore add ~100 LOC with no queue tuning, no adaptive batching, and no new configuration surface.

Concurrency Budget Design (Brief Rationale)

The budget is intentionally based on CPU cores, not just DB pool size.

Formula:

Math.max(4, Math.min(cores * 2, poolSize / 2))

Why cores × 2 is the primary driver:

On MySQL, virtual threads pin to carrier OS threads during blocking JDBC I/O.
Effective parallelism is therefore bounded by available carrier threads (≈ CPU cores), not by the number of DB connections.
Increasing permits beyond cores × 2 does not increase real throughput.

Why poolSize / 2 is a cap, not the signal:

JDBI onDemand acquires/releases connections per call.
Connections are typically held for only 1–5ms.
Pool exhaustion is not the limiting factor in practice.
poolSize / 2 acts as a safety belt, reserving capacity for:
- REST API traffic
- Other background jobs

Example budgets:

4 cores → 8 threads
8 cores → 16 threads
16 cores → 32 threads

Benchmark confirmation:

75 virtual threads → ~39s
16 virtual threads (cores × 2) → ~36s

Equivalent performance confirms that carrier thread pinning (CPU-bound parallelism), not pool size, is the true concurrency limit.

Performance Results

Dataset: 8,292 entities (load-test-data.sh --quick)
Environment: Clean Docker, identical dataset and config.

Metric	main (sequential)	feature (parallel)
DataAssetsWorkflow duration	~94 seconds	~36 seconds
DI documents indexed	8,368	8,368
Job status	success (0 failed)	success (0 failed)
Concurrency budget	N/A	16 virtual threads (cores × 2)
Speedup	baseline	~2.6x faster

Both runs produced identical results with zero failures.

How did you test your changes?

Full A/B test in a clean Docker environment.
Ran both main and feature branch on the same dataset (8,292 entities).
Triggered Data Insights pipeline.
Compared:
- Log timestamps
- DI document counts
- Job stats
Verified identical indexed document counts and zero failures.
Verified that the updated concurrency budget (16 threads, down from 75) produces identical results and equivalent performance, confirming that carrier thread pinning on MySQL was the actual concurrency limit.

Type of change:

Checklist:

I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document.
My PR title is Fixes #25808: Parallelize DataAssetsWorkflow using Java 21 virtual threads
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

Copilot

Pull request overview

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads to improve performance. The workflow processes 8,292 entities with a ~2.6x speedup (from ~94 seconds to ~36 seconds) by converting sequential entity enrichment into concurrent processing with semaphore-based concurrency control.

Changes:

Introduced parallel entity processing using Executors.newVirtualThreadPerTaskExecutor() with a concurrency budget calculated as Math.max(4, Math.min(cores * 2, poolSize / 2)) to balance CPU parallelism with database connection pool capacity
Added enrichSingle() method to DataInsightsEntityEnricherProcessor for independent single-entity enrichment in parallel contexts
Made updateStats() methods synchronized across sink and processor classes to ensure thread-safe statistics accumulation during concurrent processing
Implemented graceful shutdown support with stop() methods that propagate stop signals to active workflows and shut down executors

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file

File	Description
DataAssetsWorkflow.java	Core parallelization logic with virtual thread executor, semaphore-based concurrency control, ConcurrentLinkedQueue for bulk operations, and graceful shutdown support
DataInsightsEntityEnricherProcessor.java	New `enrichSingle()` method for per-entity enrichment without batch error wrapping, and synchronized `updateStats()` for thread safety
DataInsightsElasticSearchProcessor.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
DataInsightsOpenSearchProcessor.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
ElasticSearchIndexSink.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
OpenSearchIndexSink.java	Synchronized `updateStats()` method to prevent race conditions in concurrent stats updates
DataInsightsApp.java	Override `stop()` method to propagate shutdown signals to active DataAssetsWorkflow instance

...ps/bundles/insights/workflows/dataAssets/processors/DataInsightsEntityEnricherProcessor.java

.../org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java

Need to address copilot comments

github-actions · 2026-02-17T16:17:42Z

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

harshach · 2026-02-18T17:01:46Z

@manerow are you looking at the recent chagnes to search idnexing, it uses quartz to distribute across OM servers too. That will give you more leverage in truly making distributed indexing.
Secondly not every indexing should full delete and re-index, we should be able to specify past few days to only index the data from there

Co-authored-by: manerow <manerow@users.noreply.github.com>

gitar-bot · 2026-02-18T18:54:43Z

Analyzing CI failures

Code Review ✅ Approved 2 resolved / 2 findings

Well-structured parallelization using virtual threads with proper concurrency control. Both previous findings addressed: executor race condition fixed in finally block, Future.get() timeout intentionally omitted per author's design rationale (stop/shutdownNow provides recovery). No new issues found.

✅ 2 resolved

✅ Bug: Executor null-ed before try-with-resources close

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:322
At line 322, this.executor = null is set inside the try-with-resources block but before the implicit close() call on sourceExecutor. This creates a small window where stop() cannot reach the executor via shutdownNow() because the field is already null, but the executor hasn't actually been closed yet.

While the stopped flag provides a secondary check, the canonical pattern would be to let the try-with-resources handle cleanup and null the field in a finally block or after the try-with-resources block:
try (ExecutorService sourceExecutor = Executors.newVirtualThreadPerTaskExecutor()) {
    this.executor = sourceExecutor;
    // ... processing loop ...
} finally {
    this.executor = null;
}
This also ensures the field is nulled even if close() throws (though unlikely for virtual thread executors).

✅ Future.get() without timeout risks indefinite hang

Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

manerow · 2026-02-18T19:28:04Z

@harshach Thanks for the pointers.

Distributed indexing with Quartz: I've looked at the DistributedSearchIndexExecutor and the partition-based coordination model. The reason I didn't reuse it here is that the two pipelines work differently. In search reindexing, each entity produces one document and the bottleneck is ES/OS bulk I/O, partitioning by offset ranges maps cleanly and distributing across servers helps because the sink is the constraint. In the Data Assets workflow, the bottleneck is entity enrichment (3-5 DB round-trips per entity, fanning out into 30+ daily snapshots each), not the read or the sink, so parallelizing that I/O-bound enrichment with virtual threads on a single server is what gives us the speedup here.

Virtual threads with a semaphore parallelize that enrichment within a single server for ~100 lines of code and no new config. This isn't a replacement for distributed processing, it's the first step. Distribution would decide which entities each server handles; virtual threads speed up the work within each node. The two layers are complementary, and adapting the Quartz coordination to split entity types across OM instances would be a natural follow-up. This PR is the first step, distributed coordination would be the next.

Incremental indexing: Agreed, already tracked in #25809. The plan is to filter entities by updatedAt > lastSuccessfulRun and switch report data to upsert instead of delete-reinsert. For a 100K-entity deployment with ~1% daily change, that drops processed entities from 100K to ~1K per run. Also complementary to this PR.

sonarqubecloud · 2026-02-18T19:45:24Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

harshach · 2026-02-18T19:51:24Z

@manerow if you are planning on doing distributed job in another PR, that works for me. Even with virutal threads if its doing long enough of days of loopback then we will lock those tables for a while. here its better distribute based on no.of days that user want tor eindex from

manerow · 2026-02-18T22:16:35Z

@harshach Sounds good. I'll create a task for the distributed approach with date-range partitioning for backfills and tackle it in a separate PR.

manerow self-assigned this Feb 11, 2026

manerow requested a review from a team as a code owner February 11, 2026 12:25

manerow added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch backend labels Feb 11, 2026

manerow temporarily deployed to test February 11, 2026 12:25 — with GitHub Actions Inactive

manerow had a problem deploying to test February 11, 2026 12:25 — with GitHub Actions Failure

manerow temporarily deployed to test February 11, 2026 12:25 — with GitHub Actions Inactive

manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from f491a98 to 5778830 Compare February 11, 2026 13:44

manerow temporarily deployed to test February 11, 2026 13:52 — with GitHub Actions Inactive

TeddyCr removed the To release Will cherry-pick this PR into the release branch label Feb 11, 2026

TeddyCr requested a review from Copilot February 11, 2026 15:41

Copilot started reviewing on behalf of TeddyCr February 11, 2026 15:42 View session

TeddyCr previously approved these changes Feb 11, 2026

View reviewed changes

Copilot AI reviewed Feb 11, 2026

View reviewed changes

manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from 5778830 to bd87379 Compare February 12, 2026 09:55

manerow had a problem deploying to test February 12, 2026 09:55 — with GitHub Actions Error

gitar-bot bot temporarily deployed to test February 17, 2026 16:17 Inactive

gitar-bot bot temporarily deployed to test February 18, 2026 11:43 Inactive

manerow and others added 4 commits February 18, 2026 19:38

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)

e727b8f

fix: address PR review comments for parallel DataAssetsWorkflow

0fef29a

fix: null executor field in finally block to avoid race with stop()

41d76ee

chore: sync pr branch and update pr context documentation

76c73df

Co-authored-by: manerow <manerow@users.noreply.github.com>

manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from cf80bff to 76c73df Compare February 18, 2026 18:38

manerow temporarily deployed to test February 18, 2026 18:38 — with GitHub Actions Inactive

manerow had a problem deploying to test February 18, 2026 18:38 — with GitHub Actions Failure

manerow temporarily deployed to test February 18, 2026 18:38 — with GitHub Actions Inactive

manerow had a problem deploying to test February 18, 2026 18:38 — with GitHub Actions Failure

manerow temporarily deployed to test February 18, 2026 18:38 — with GitHub Actions Inactive

manerow temporarily deployed to test February 18, 2026 21:18 — with GitHub Actions Inactive

manerow deployed to test February 18, 2026 21:18 — with GitHub Actions Active

Delete pr_context.md

cb0a340

TeddyCr requested a deployment to test February 18, 2026 22:50 — with GitHub Actions In progress

Conversation

manerow commented Feb 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changed

Why virtual threads instead of reusing SearchIndexApp’s producer-consumer model

Concurrency Budget Design (Brief Rationale)

Performance Results

How did you test your changes?

Type of change:

Checklist:

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

github-actions bot commented Feb 17, 2026

Uh oh!

harshach commented Feb 18, 2026

Uh oh!

gitar-bot bot commented Feb 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

manerow commented Feb 18, 2026

Uh oh!

sonarqubecloud bot commented Feb 18, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

harshach commented Feb 18, 2026

Uh oh!

manerow commented Feb 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Comments

manerow commented Feb 11, 2026 •

edited

Loading

Why virtual threads instead of reusing `SearchIndexApp`’s producer-consumer model

gitar-bot bot commented Feb 18, 2026 •

edited

Loading