-
Notifications
You must be signed in to change notification settings - Fork 2.5k
Description
Is your feature request related to a problem? Please describe
Summary
OpenSearch today relies exclusively on Apache Lucene for data storage and query execution. While Lucene excels at full-text search and near real-time indexing, this tight coupling limits our ability to support workloads optimized for different storage formats (covered here in more details: #18416 ). Users running analytical queries on time-series data, for example, would benefit from columnar formats like Parquet that offer superior compression and scan performance. Similarly, vector similarity search and graph analytics require specialized data structures which may require modeling outside the Lucene codecs.
This document proposes the details around OpenSearch Execution Engine—a pluggable architecture that enables multiple storage formats and execution engines to coexist within OpenSearch. The design maintains backward compatibility with existing Lucene-based indices while providing clear extension points for new formats like Parquet, ORC, Lance, etc. We focus on three key areas: (1) the plugin interfaces that enable format extensibility and their bundling, (2) how these plugins integrate with OpenSearch's indexing and data availability flows, and (3) how the architecture composes with existing engines like InternalEngine, IngestionEngine, NRTReplicationEngine, etc..
Goals
- Enable pluggable storage formats beyond Lucene (Parquet, ORC, Arrow, custom formats) and any auxiliary data structures to facilitate faster search, without degrading on the memory overhead during document indexing.
- Enable ingesting documents to the choice of index (multi format or Lucene) on the same distribution to facilitate migration. We can choose to apply artificial blocks to restrict the type of indexing supported on a cluster.
- Cross Version Compatibility: The design should take care of ensuring any changes in data-format are supported to be read across next N OpenSearch versions. Also, during a rolling/blue-green deployment, transport communication between nodes should continue to work without any downtime [transport is not in scope of below discussion and would be covered later]. For existing indices, we may require re-indexing to use newer proposed implementation in this document during the migration.
- Provide clear plugin interfaces with minimal implementation overhead. Reuse existing OpenSearch constructs and integration touch-points wherever possible to ensure any development happening around the peripheral/common features is applicable to the new and existing engines.
Current State: Lucene Integration Layers
OpenSearch's architecture is deeply integrated with Lucene across three primary layers. Understanding these integration points is essential for designing the abstraction layer.
Layer 1: Engine Integration
The InternalEngine class orchestrates indexing and search operations using Lucene's IndexWriter and IndexSearcher. All document operations flow through these Lucene-specific APIs:
- Current Dependencies: -
IndexWriter- Document indexing and lifecycle management -IndexReader/IndexSearcher- Query execution -MergePolicy- Segment merging strategies - LuceneDocument- Document representation - Challenge: Alternative engines (Arrow-rs & DataFusion) have their own indexing and query APIs. We need an abstraction that allows IndexShard to delegate to different execution engines without knowing their implementation details.
Layer 2: Store Integration
OpenSearch replicates Lucene segments to remote storage (Amazon S3, Azure Blob) for durability and disaster recovery. The RemoteSegmentStoreDirectory directly manipulates Lucene-specific files:
- Current Dependencies: -
SegmentInfos- Lucene's segment metadata structure - Segment files (.si,.cfs,.cfe,.fnm) - Lucene-specific formats -IndexInput/IndexOutput- Lucene's I/O abstractions - Challenge: Non-Lucene formats (Parquet, ORC) have different file structures and metadata. The remote storage layer needs format-agnostic abstractions to handle arbitrary file types. The metadata can also extend to Apache Icerberg spec, custom commit format, etc. and the remote store layer should be able to handle those changes
Layer 3: Query and Aggregation Framework
OpenSearch's Query DSL translates to Lucene Query objects, and aggregations directly access Lucene's DocValues and LeafReaderContext:
- Current Dependencies: -
Query/Weight- Lucene query representation and execution -DocValues- Columnar field access within Lucene -LeafReaderContext- Segment-level data access - Challenge: Alternative query engines have different query representations and execution models. We need a logical query representation that can be translated to engine-specific physical plans.
Interface Proposal
The current indexing/read happens through the Engine abstraction. This is responsible for all Lucene related operations. Engine abstractions comes with tight coupling to Lucene, and hence we propose decoupling through a new composable layer [the placement for these components within the repository is discussed in a later section]:
DataFormat:
The dataformat interfaces exposes the dataformat, along with name and version. This name and version would be persisted in the metadata for commit corresponding to each file/segment created. The version would be used while reading while the write would also work against the latest version which is part of the distribution. Since read side engine is decoupled here from the data writes, we haven’t introduced a full fledged codec here but only a version per data format which should allow engines like datafusion to understand what metadata/structure to expect with the data. This format version would be decoupled from the actual OpenSearch version and can grow based on the changes, while ensuring read backwards compatibility. We may be limited by some engines (e.g. Lucene, see apache/lucene#13797) on how much each engine can support read compatibility. OpenSearch will be limited by the same for its backward compatibility.
Engine
Indexer [Pluggable]
The Indexer interface serves as the primary abstraction for all document indexing operations in OpenSearch. It defines the contract for index, delete, and update operations, along with lifecycle methods like refresh, flush, and merge that are essential for making documents searchable and ensuring durability. By abstracting these operations behind a common interface, the Indexer enables OpenSearch to work with different storage formats without modifying higher-level code like IndexShard or RemoteStore. This interface is implemented by both the traditional Lucene-based InternalEngine and the new CompositeEngine, ensuring backward compatibility while enabling multi-format support. The Indexer also provides methods for acquiring catalog snapshots, which are critical for coordinating operations across multiple formats. This would be the main layer to facilitate how OpenSearch code can decouple from Lucene.
CompositeEngine
CompositeEngine is the orchestrator that coordinates operations across multiple storage formats within a single index. It implements the Indexer interface, making it a drop-in replacement for the traditional Lucene-based engine while internally delegating to multiple IndexingExecutionEngine instances—one for each configured format. When a document is indexed, CompositeEngine distributes it to all registered execution engines (e.g., Lucene and Parquet), ensuring that each format receives the same document with a common row ID for cross-format correlation. During refresh operations, CompositeEngine coordinates the flush of all formats and aggregates their file metadata into a unified CatalogSnapshot, maintaining consistency across formats. The engine also manages resource allocation, ensuring that each format operates within its allocated memory and thread quotas to prevent interference. The components that follow from here are integrated into the CompositeEngine to facilitate its orchestration:
IndexingExecutionEngine [Pluggable]
IndexingExecutionEngine represents a format-specific execution engine responsible for transforming OpenSearch documents into a particular storage format and managing the lifecycle of that format's data structures. Each execution engine encapsulates the logic for writing documents to its format (e.g., Lucene segments or Parquet row groups), flushing buffered data to disk, and providing metadata about the files it creates. The engine maintains a pool of Writer instances that handle the low-level serialization of documents to the format-specific representation, reusing writers across operations to minimize allocation overhead. During refresh, the execution engine flushes its writers and returns metadata that describes the files created, including their names, sizes, and format-specific properties like row group boundaries or bloom filter statistics. This abstraction allows new formats to be added by implementing a new IndexingExecutionEngine without modifying the CompositeEngine coordination logic.
Committer [Pluggable]
The Committer component is responsible for ensuring durability by persisting all indexed data and metadata to disk in a crash-recoverable manner. It coordinates the commit operation across all formats, ensuring that either all formats commit successfully or none do, maintaining atomicity. The Committer first triggers a flush on all IndexingExecutionEngine instances to ensure all buffered data is written to disk, then syncs all files to ensure they are durable, and finally writes the CatalogSnapshot metadata that describes the committed state. In the current implementation, LuceneBasedCommitter leverages Lucene's existing commit infrastructure (SegmentInfos) as the authoritative source of truth, extending it with additional metadata to track non-Lucene formats. This approach ensures that recovery can reconstruct the state of all formats from the last committed catalog snapshot plus any operations recorded in the translog. In future, this can be extended to support different specifications to commit data such as Iceberg.
CatalogSnapshot and CatalogSnapshotManager
CatalogSnapshotManager maintains an immutable, point-in-time view of all files across all formats in an index (CatalogSnapshot), enabling consistent queries and recovery. Each refresh operation creates a new CatalogSnapshot that includes metadata from all formats: Lucene's SegmentInfos, Parquet file metadata, and any custom format metadata. The manager ensures that snapshots are created atomically during refresh, preventing partial visibility of documents across formats. It also handles the lifecycle of snapshots, tracking which snapshots are in use by active queries and cleaning up old snapshots when they are no longer needed. The CatalogSnapshot includes generation numbers for ordering, file checksums for integrity verification, and format-specific metadata like row counts and column statistics that can be used for query optimization. By providing a unified view of all format files at a specific point in time, the CatalogSnapshotManager enables features like point-in-time queries and consistent remote store/snapshot/restore operations.
MergeHandler [Pluggable]
MergeHandler manages the background merging of data files within each format to optimize storage and query performance. In Lucene, merging combines multiple small segments into larger ones to reduce the number of files and improve query efficiency; similarly, in columnar formats like Parquet, merging can consolidate small row groups and rewrite data with better compression or encoding. The CompositeMergeHandler coordinates merge operations across all formats, ensuring that merges are scheduled appropriately and don't interfere with ongoing indexing or query operations.
Searcher [Pluggable]
Given CompositeEngine would be the layer responsible for managing the data across supported formats, the engine through which data can be queried (e.g. existing Lucene, DataFusion, etc) would be within a Searcher interface. This would be responsible for executing multiple phases for the query over a point in time context using the implementation provided by the underlying engine.
More details for each of these components will be covered in the associated PR alongside to review the idea with the community.
EnginePlugin
The pluggable components mentioned above would be exposed within the EnginePlugin.
Bundling Proposal
- openseearch-project
- opensearch
- server
- dataformat-definition
- composite-dataformat
- lucene-composable-dataformat [Lucene implementation composable with other formats]
- [dependencies]
- arrow
- libs
- native-framework
- dataformat-framework
- query-aggs-framework
- sandbox
- plugins
- parquet-dataformat [Parquet Specific implementation]
- engine-datafusion
Following components would be needed in order to achieve the goal of multiple data formats while using different native dependencies:
- Native Framework: Holds the interfaces and contracts needed for defining common convention on how the high level interaction between native engines (indexing/search) and Opensearch would happen. This may also include native language dependent components which can be reused across implementations such as tokio runtime management, etc.
- DataFormat framework: Holds the interface and registry components for different data formats, including test framework for testing composability, and basic sanity tests required for any dataformat to work with OpenSearch.
- Composite Engine/DataFormat: Code related to ensuring orchestrating various
- Parquet DataFormat: Code related to writing parquet data
- Lucene DataFormat: Code related to writing Lucene data in a composable way with other formats.
- Search components: Interfaces to introduce new components for Query/aggs framework and engine interactions around datafusion/velox, etc.
Key Decision Drivers:
- Native Dependencies in core: To support arrow like intermediate representation as the standard for buffered documents, and query results, apache arrow is the preferred choice due to its vectorization friendly structure. While native dependencies are sparse in core (zstd through lib for http compression), OpenSearch uses various dependencies in modules/plugins like netty, k-nn, custom-codecs, and with the modern engine implementation, we can increase the core’s functionality by introducing arrow as a dependency for core directly and let it naturally extend to flows like ingestion, translog, etc. as it has support for compression like features as well whcih will benefit users in long term.
- Usability of fundamental components in core flows and plugins: For reusable components, we have placed them in lib where possible (to keep the core light) but at some places (e.g. dataformat definition), we need dependency on existing components in core due to which they are placed directly in core
- Convergence of implementation: Lucene, as a composable format can reside in a plugin while the existing InternalEngine implementation continues to be in core. But this may end up creating 2 different paths for anyone willing to make changes in Lucene path, or bringing in any improvements which can come from Lucene. In order to ensure that over time we are able to keep a single implementation, the proposal puts the lucene-composable-dataformat related code in core directly. Over time, as the new lucene implementation is hardened, we can propose to deprecate the InternalEngine implementation across major versions by providing building blocks to plugin developers as well to easily migrate.