Add ingest pipeline support for pull-based ingestion#20873
Add ingest pipeline support for pull-based ingestion#20873imRishN wants to merge 13 commits intoopensearch-project:mainfrom
Conversation
…based ingestion Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
…tion Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
PR Reviewer Guide 🔍(Review updated until commit 9a50135)Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Latest suggestions up to 9a50135 Explore these optional code suggestions:
Previous suggestionsSuggestions up to commit e0dc3f4
Suggestions up to commit 04af53e
Suggestions up to commit 28c9752
Suggestions up to commit b771d73
Suggestions up to commit be0f220 |
|
❌ Gradle check result for 6771f6e: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit 9d488ed |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #20873 +/- ##
============================================
- Coverage 73.31% 73.22% -0.09%
+ Complexity 72248 72165 -83
============================================
Files 5795 5796 +1
Lines 330044 330134 +90
Branches 47641 47648 +7
============================================
- Hits 241975 241748 -227
- Misses 68609 68984 +375
+ Partials 19460 19402 -58 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
| } | ||
|
|
||
| /** | ||
| * Resolves pipeline names from index settings. Called lazily on first document and cached. |
There was a problem hiding this comment.
Are we resolving the pipelines lazily on first document? It seems like we only resolve when the MessageProcessor is initialized? I feel what we do now (resolve on initialization) is better.
There was a problem hiding this comment.
Yeah fixing the comment
|
|
||
| // Block until pipeline execution completes (with timeout) | ||
| try { | ||
| future.get(PIPELINE_EXECUTION_TIMEOUT_SECONDS, TimeUnit.SECONDS); |
There was a problem hiding this comment.
I'm wondering if it would be better to add synchronous execution support in IngestService, something like
executePipelineSync(..) {
CountDownLatch latch = new CountDownLatch(1);
// execute the pipeline (ex: innerExecute(..))
latch.await()
}
If this is possible, we could possibly execute the pipelines on the same thread avoiding the thread handoff. For async pipelines, it would still continue to wait for the result to be available.
What do you think? Have we already explored this path and run into any other challenges?
There was a problem hiding this comment.
Hmm, executing on same thread would avoid the handoff overhead. But there are a few things I considered -
IngestServiceis a core class used by all push based indexing. Adding a sync execution might modify a stable interface and would need deeper review which could be beyond the scope of this PR. Additionally,runBulkRequestInBatch()handles batching, metrics tracking, pipeline chaining, index change detection, and slot management. While we can expose a sync path through all of that but looks non trivial. Wdyt?- For async processors, we'd still need a latch/future to block for those cases. The internal
Pipeline.execute()->IngestDocument.executePipeline()chain is fundamentally callback-based - And there seem to be no practical impact for most processors(low weight simpler ones) as execution time dominates the context switch cost
Even with above nuances, this could be a valid optimization and can be taken up as a follow up when we benchmark our changes. Can create a tracking issue for this. Let me know your thoughts
| updateFinalPipeline(IndexSettings.FINAL_PIPELINE.get(indexSettings.getSettings())); | ||
|
|
||
| // Register dynamic settings listener for final_pipeline updates | ||
| indexSettings.getScopedSettings().addSettingsUpdateConsumer(IndexSettings.FINAL_PIPELINE, this::updateFinalPipeline); |
There was a problem hiding this comment.
Should we register the listener in the constructor before resolvePipelineNames can be called so we don't miss any update?
There was a problem hiding this comment.
Yes, good point. This also helped to remove some cluttered code.
| String indexName = engine.config().getIndexSettings().getIndex().getName(); | ||
| this.engine = engine; | ||
| this.index = indexName; | ||
| this.pipelineExecutor = new IngestPipelineExecutor(ingestService, indexName); |
There was a problem hiding this comment.
Is it possible to create a single instance of IngestPipelineExecutor at the IngestionEngineFactory layer and pass it all the way through? Thinking if it can help us avoid duplicate settings update consumer registration across writer threads.
There was a problem hiding this comment.
Good suggestion. Refactored to create a single IngestPipelineExecutor instance in IngestionEngine and pass it through the chain instead of IngestService. Now this has also eliminated duplicate listener registerations.
Also, as we discussed is prior PRs, this has in a way cleaned up the wiring and intermediate layers no longer know about IngestService at all. The wiring looks like this now:
IngestionEngine (creates single IngestPipelineExecutor) → DefaultStreamPoller (passes executor through) → PartitionedBlockingQueueContainer (passes executor through) → N × MessageProcessorRunnable (all share the same executor)
IngestService is now only referenced in IngestionEngine and IngestPipelineExecutor itself. The rest of the pull-based path is decoupled from it.
| this.engine = engine; | ||
| this.index = indexName; | ||
| this.pipelineExecutor = new IngestPipelineExecutor(ingestService, indexName); | ||
| this.pipelineExecutor.resolvePipelineNames(engine.config().getIndexSettings()); |
There was a problem hiding this comment.
Is it possible to call resolvePipelineNames inside the IngestPipelineExecutor constructor instead of exposing it outside?
There was a problem hiding this comment.
Done. Resolution is now fully encapsulated in the constructor
| * synchronously by bridging IngestService's async callback API with CompletableFuture. | ||
| * Only {@code final_pipeline} is supported. | ||
| */ | ||
| public class IngestPipelineExecutor { |
There was a problem hiding this comment.
We can highlight in the javadocs that ingest pipeline/processors on pull-based ingestion flow does not require INGEST roles and executes the transformations on the current node (request is not forwarded to ingest nodes).
There was a problem hiding this comment.
Yeah missed adding that. Added now
| POLLING, | ||
| PROCESSING | ||
| PROCESSING, | ||
| PIPELINE |
There was a problem hiding this comment.
Are we using the new error stage anywhere? PIPELINE will be a part of the PROCESSING stage, maybe we can avoid the new error stage?
There was a problem hiding this comment.
Added this for fine grained tracking, but removing for now in context of this PR. Will review if this is needed later
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
…estion Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit be0f220 |
|
❌ Gradle check result for be0f220: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit b771d73 |
|
❌ Gradle check result for b771d73: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
…ionEngine Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit 28c9752 |
|
❌ Gradle check result for 28c9752: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit 04af53e |
|
Persistent review updated to latest commit e0dc3f4 |
|
❌ Gradle check result for e0dc3f4: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Rishab Nahata <rishabnahata07@gmail.com>
|
Persistent review updated to latest commit 9a50135 |
|
❌ Gradle check result for 9a50135: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Adds final_pipeline execution support to the pull-based ingestion path. Documents are transformed by configured ingest pipelines before being written to Lucene.
Related Issues
Resolves -
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.