Skip to content

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)#25817

Open
manerow wants to merge 5 commits intomainfrom
feat/parallel-data-assets-workflow-25808
Open

feat: Parallelize DataAssetsWorkflow with virtual threads (#25808)#25817
manerow wants to merge 5 commits intomainfrom
feat/parallel-data-assets-workflow-25808

Conversation

@manerow
Copy link
Contributor

@manerow manerow commented Feb 11, 2026

Fixes #25808

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads, reducing wall-clock time by ~2.6x on a dataset of 8,292 entities.

I worked on improving the performance of the Data Insights pipeline because the DataAssetsWorkflow was executing sequentially and spending significant time in blocking database calls during entity enrichment. Since enrichment is heavily I/O-bound (multiple DB round-trips per entity), virtual threads allow efficient concurrency without exhausting platform threads or the DB connection pool.


What changed

  • DataAssetsWorkflow now processes entities concurrently using a virtual-thread-per-task executor with a semaphore-based concurrency budget:

    Math.max(4, Math.min(cores * 2, poolSize / 2))
    
    • Primary signal: cores × 2
    • Hard cap: poolSize / 2
    • Minimum: 4

    This scales with machine capacity while keeping half of the DB pool free for REST/API traffic and other jobs.

  • Added enrichSingle() to DataInsightsEntityEnricherProcessor so individual entities can be enriched independently on virtual threads.

  • Enriched results are collected in a ConcurrentLinkedQueue and bulk-flushed to the search index after each batch.

  • Made updateStats() methods synchronized across:

    • DataInsightsElasticSearchProcessor
    • DataInsightsOpenSearchProcessor
    • DataInsightsEntityEnricherProcessor
    • ElasticSearchIndexSink
    • OpenSearchIndexSink
      to ensure thread-safe stat accumulation.
  • Added graceful stop support: DataInsightsApp.stop() now propagates to the active DataAssetsWorkflow, which shuts down its executor and sets a stopped flag.


Why virtual threads instead of reusing SearchIndexApp’s producer-consumer model

SearchIndexExecutor is optimized for:

Read → Index (1 entity → 1 document)

Its bottleneck is Elasticsearch I/O, and it uses platform thread pools, blocking queues, adaptive batching, and async bulk sinks.

DataAssetsWorkflow differs:

  1. I/O-bound enrichment per entity
    Each entity performs 3–5+ blocking DB calls (version history + owner/team resolution).

  2. 1:N data amplification
    One entity can produce 30+ daily snapshot documents, making fixed queue sizing awkward.

  3. 4-stage pipeline

    Read → Enrich → Process → Sink
    The bottleneck is enrichment (middle stage), not read or sink.

  4. Less complexity
    Virtual threads + semaphore add ~100 LOC with no queue tuning, no adaptive batching, and no new configuration surface.


Concurrency Budget Design (Brief Rationale)

The budget is intentionally based on CPU cores, not just DB pool size.

Formula:

Math.max(4, Math.min(cores * 2, poolSize / 2))

Why cores × 2 is the primary driver:

  • On MySQL, virtual threads pin to carrier OS threads during blocking JDBC I/O.
  • Effective parallelism is therefore bounded by available carrier threads (≈ CPU cores), not by the number of DB connections.
  • Increasing permits beyond cores × 2 does not increase real throughput.

Why poolSize / 2 is a cap, not the signal:

  • JDBI onDemand acquires/releases connections per call.

  • Connections are typically held for only 1–5ms.

  • Pool exhaustion is not the limiting factor in practice.

  • poolSize / 2 acts as a safety belt, reserving capacity for:

    • REST API traffic
    • Other background jobs

Example budgets:

  • 4 cores → 8 threads
  • 8 cores → 16 threads
  • 16 cores → 32 threads

Benchmark confirmation:

  • 75 virtual threads → ~39s
  • 16 virtual threads (cores × 2) → ~36s

Equivalent performance confirms that carrier thread pinning (CPU-bound parallelism), not pool size, is the true concurrency limit.


Performance Results

Dataset: 8,292 entities (load-test-data.sh --quick)
Environment: Clean Docker, identical dataset and config.

Metric main (sequential) feature (parallel)
DataAssetsWorkflow duration ~94 seconds ~36 seconds
DI documents indexed 8,368 8,368
Job status success (0 failed) success (0 failed)
Concurrency budget N/A 16 virtual threads (cores × 2)
Speedup baseline ~2.6x faster

Both runs produced identical results with zero failures.


How did you test your changes?

  • Full A/B test in a clean Docker environment.

  • Ran both main and feature branch on the same dataset (8,292 entities).

  • Triggered Data Insights pipeline.

  • Compared:

    • Log timestamps
    • DI document counts
    • Job stats
  • Verified identical indexed document counts and zero failures.

  • Verified that the updated concurrency budget (16 threads, down from 75) produces identical results and equivalent performance, confirming that carrier thread pinning on MySQL was the actual concurrency limit.


Type of change:

  • Bug fix
  • Improvement
  • New feature
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Documentation

Checklist:

  • I have read the [CONTRIBUTING](https://docs.open-metadata.org/developers/contribute) document.
  • My PR title is Fixes #25808: Parallelize DataAssetsWorkflow using Java 21 virtual threads
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

@manerow manerow self-assigned this Feb 11, 2026
@manerow manerow requested a review from a team as a code owner February 11, 2026 12:25
@manerow manerow added safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch backend labels Feb 11, 2026
@manerow manerow force-pushed the feat/parallel-data-assets-workflow-25808 branch from f491a98 to 5778830 Compare February 11, 2026 13:44
@TeddyCr TeddyCr removed the To release Will cherry-pick this PR into the release branch label Feb 11, 2026
@TeddyCr TeddyCr requested a review from Copilot February 11, 2026 15:41
TeddyCr
TeddyCr previously approved these changes Feb 11, 2026
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR parallelizes the DataAssetsWorkflow in the Data Insights pipeline using Java 21 virtual threads to improve performance. The workflow processes 8,292 entities with a ~2.6x speedup (from ~94 seconds to ~36 seconds) by converting sequential entity enrichment into concurrent processing with semaphore-based concurrency control.

Changes:

  • Introduced parallel entity processing using Executors.newVirtualThreadPerTaskExecutor() with a concurrency budget calculated as Math.max(4, Math.min(cores * 2, poolSize / 2)) to balance CPU parallelism with database connection pool capacity
  • Added enrichSingle() method to DataInsightsEntityEnricherProcessor for independent single-entity enrichment in parallel contexts
  • Made updateStats() methods synchronized across sink and processor classes to ensure thread-safe statistics accumulation during concurrent processing
  • Implemented graceful shutdown support with stop() methods that propagate stop signals to active workflows and shut down executors

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 11 comments.

Show a summary per file
File Description
DataAssetsWorkflow.java Core parallelization logic with virtual thread executor, semaphore-based concurrency control, ConcurrentLinkedQueue for bulk operations, and graceful shutdown support
DataInsightsEntityEnricherProcessor.java New enrichSingle() method for per-entity enrichment without batch error wrapping, and synchronized updateStats() for thread safety
DataInsightsElasticSearchProcessor.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
DataInsightsOpenSearchProcessor.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
ElasticSearchIndexSink.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
OpenSearchIndexSink.java Synchronized updateStats() method to prevent race conditions in concurrent stats updates
DataInsightsApp.java Override stop() method to propagate shutdown signals to active DataAssetsWorkflow instance

@github-actions
Copy link
Contributor

Hi there 👋 Thanks for your contribution!

The OpenMetadata team will review the PR shortly! Once it has been labeled as safe to test, the CI workflows
will start executing and we'll be able to make sure everything is working as expected.

Let us know if you need any help!

@harshach
Copy link
Collaborator

@manerow are you looking at the recent chagnes to search idnexing, it uses quartz to distribute across OM servers too. That will give you more leverage in truly making distributed indexing.
Secondly not every indexing should full delete and re-index, we should be able to specify past few days to only index the data from there

@gitar-bot
Copy link

gitar-bot bot commented Feb 18, 2026

Analyzing CI failures

Code Review ✅ Approved 2 resolved / 2 findings

Well-structured parallelization using virtual threads with proper concurrency control. Both previous findings addressed: executor race condition fixed in finally block, Future.get() timeout intentionally omitted per author's design rationale (stop/shutdownNow provides recovery). No new issues found.

✅ 2 resolved
Bug: Executor null-ed before try-with-resources close

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/insights/workflows/dataAssets/DataAssetsWorkflow.java:322
At line 322, this.executor = null is set inside the try-with-resources block but before the implicit close() call on sourceExecutor. This creates a small window where stop() cannot reach the executor via shutdownNow() because the field is already null, but the executor hasn't actually been closed yet.

While the stopped flag provides a secondary check, the canonical pattern would be to let the try-with-resources handle cleanup and null the field in a finally block or after the try-with-resources block:

try (ExecutorService sourceExecutor = Executors.newVirtualThreadPerTaskExecutor()) {
    this.executor = sourceExecutor;
    // ... processing loop ...
} finally {
    this.executor = null;
}

This also ensures the field is nulled even if close() throws (though unlikely for virtual thread executors).

  • Future.get() without timeout risks indefinite hang
Options

Auto-apply is off → Gitar will not commit updates to this branch.
Display: compact → Showing less information.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

@manerow
Copy link
Contributor Author

manerow commented Feb 18, 2026

@harshach Thanks for the pointers.

Distributed indexing with Quartz: I've looked at the DistributedSearchIndexExecutor and the partition-based coordination model. The reason I didn't reuse it here is that the two pipelines work differently. In search reindexing, each entity produces one document and the bottleneck is ES/OS bulk I/O, partitioning by offset ranges maps cleanly and distributing across servers helps because the sink is the constraint. In the Data Assets workflow, the bottleneck is entity enrichment (3-5 DB round-trips per entity, fanning out into 30+ daily snapshots each), not the read or the sink, so parallelizing that I/O-bound enrichment with virtual threads on a single server is what gives us the speedup here.

Virtual threads with a semaphore parallelize that enrichment within a single server for ~100 lines of code and no new config. This isn't a replacement for distributed processing, it's the first step. Distribution would decide which entities each server handles; virtual threads speed up the work within each node. The two layers are complementary, and adapting the Quartz coordination to split entity types across OM instances would be a natural follow-up. This PR is the first step, distributed coordination would be the next.

Incremental indexing: Agreed, already tracked in #25809. The plan is to filter entities by updatedAt > lastSuccessfulRun and switch report data to upsert instead of delete-reinsert. For a 100K-entity deployment with ~1% daily change, that drops processed entities from 100K to ~1K per run. Also complementary to this PR.

@sonarqubecloud
Copy link

@harshach
Copy link
Collaborator

@manerow if you are planning on doing distributed job in another PR, that works for me. Even with virutal threads if its doing long enough of days of loopback then we will lock those tables for a while. here its better distribute based on no.of days that user want tor eindex from

@manerow
Copy link
Contributor Author

manerow commented Feb 18, 2026

@harshach Sounds good. I'll create a task for the distributed approach with date-range partitioning for backfills and tackle it in a separate PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Data Insight: Migrate reindexing to distributed search indexing

4 participants

Comments