Skip to content

Add foundational classes for storage support for multi-format data support#20943

Draft
ask-kamal-nayan wants to merge 1 commit intoopensearch-project:mainfrom
ask-kamal-nayan:composite-data-format-storage
Draft

Add foundational classes for storage support for multi-format data support#20943
ask-kamal-nayan wants to merge 1 commit intoopensearch-project:mainfrom
ask-kamal-nayan:composite-data-format-storage

Conversation

@ask-kamal-nayan
Copy link

@ask-kamal-nayan ask-kamal-nayan commented Mar 20, 2026

Description

This PR introduces the foundational storage layer classes and interfaces required for composite data format support. It establishes the directory abstractions, metadata
structures, and remote store extensions that allow files from multiple data formats to coexist within a single shard — each stored in its own subdirectory and routed
through format-aware directory implementations.

This is the first in a series of PRs. Subsequent PRs will add the composite indexing engine, data format plugin integration, and end-to-end read/write paths.

This is the first in a series of PRs building out multi-format storage support.

Changes

New Classes

  • FileMetadata — Encapsulates a file's data format and name, enabling format-aware file identification across the storage layer.
  • CompositeStoreDirectory — Format-aware local directory that delegates path routing to SubdirectoryAwareDirectory and adds format-specific checksum calculation (
    CodecUtil for Lucene, CRC32 for others).
  • CompositeRemoteDirectory — Extends RemoteDirectory with per-format BlobContainer routing for remote segment uploads/downloads.
  • CompositeRemoteSegmentStoreDirectory — Extends RemoteSegmentStoreDirectory to handle composite format metadata and format-aware remote segment operations.
  • SubdirectoryAwareDirectory — Lucene FilterDirectory that routes file operations across subdirectories within the shard data path (extracted from
    SubdirectoryAwareStore inner class to server for reuse).
  • CompositeEngineCatalogSnapshot / SegmentInfosCatalogSnapshot — Catalog snapshot implementations for composite engine and standard Lucene segments respectively.
  • MetadataFilenameUtils — Extracted from RemoteSegmentStoreDirectory inner class to a top-level utility class.
  • UploadedSegmentMetadata — Extracted from RemoteSegmentStoreDirectory inner class to a standalone class.

Modified Classes

  • CatalogSnapshot — Added Writeable and Cloneable support for serialization.
  • Segment — Added getDFGroupedSearchableFiles(), getGeneration(), and writeTo() to support composite catalog snapshots.
  • StoreFileMetadata — Added dataFormat field to track which format a file belongs to (defaults to "lucene").
  • RemoteSegmentStoreDirectory — Refactored to extract inner classes (MetadataFilenameUtils, UploadedSegmentMetadata) into top-level classes. Added format-aware
    imports.
  • RemoteDirectory — Added deleteFile(UploadedSegmentMetadata) overload.
  • RemoteSegmentMetadata — Updated to support composite format metadata.
  • SubdirectoryAwareStore — Removed inner SubdirectoryAwareDirectory class, now imports from server.

Testing

This PR introduces foundational classes and interfaces. Tests will follow in subsequent PRs as the composite engine integration is built out.

Related Issues

Resolves #[Issue number to be closed when this PR is merged]

Check List

  • Functionality includes testing.
  • API changes companion pull request created, if applicable.
  • Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions
Copy link
Contributor

PR Code Analyzer ❗

AI-powered 'Code-Diff-Analyzer' found issues on commit 5151b2f.

PathLineSeverityDescription
server/src/main/java/org/opensearch/index/store/UploadedSegmentMetadata.java85mediumLucene major version validation is intentionally commented out in fromString(). The call `metadata.setWrittenByMajor(Integer.parseInt(values[4]))` is disabled, bypassing the version compatibility check that prevents loading segments written by incompatible Lucene versions. This could allow corrupt or incompatible segment data to be accepted silently.
server/src/main/java/org/opensearch/index/engine/exec/FileMetadata.java11lowImports `reactor.util.annotation.NonNull` from Project Reactor, which is an unusual dependency for the OpenSearch server module core engine package. This is only used for a `@NonNull` annotation on `toString()` and could introduce an unvetted transitive dependency into a critical code path.
server/src/main/java/org/opensearch/index/store/UploadedSegmentMetadata.java73lowfromString() constructs a `java.io.File` object from the input string and calls `getName()` to strip path components before parsing. This silently discards directory path information from the stored filename, which could cause mismatches between stored and retrieved metadata if filenames include directory prefixes, or mask path traversal attempts in file identifiers.
server/src/main/java/org/opensearch/index/store/CompositeRemoteSegmentStoreDirectory.java167lowThe constructor taking a `RemoteDirectory` explicitly sets `this.compositeRemoteDirectory = null`, yet later methods such as `delete()` include the comment 'Always call compositeRemoteDirectory - no null checks' and dereference it unconditionally. This inconsistency between the null assignment and the stated invariant could mask the actual code path taken during deletion, making it difficult to audit whether the correct remote data is being cleaned up.

The table above displays the top 10 most important findings.

Total: 4 | Critical: 0 | High: 0 | Medium: 1 | Low: 3


Pull Requests Author(s): Please update your Pull Request according to the report above.

Repository Maintainer(s): You can bypass diff analyzer by adding label skip-diff-analyzer after reviewing the changes carefully, then re-run failed actions. To re-enable the analyzer, remove the label, then re-run all actions.


⚠️ Note: The Code-Diff-Analyzer helps protect against potentially harmful code patterns. Please ensure you have thoroughly reviewed the changes beforehand.

Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant