Skip to content

Conversation

@clintropolis
Copy link
Member

Description

This PR introduces a new segment format, taking learning from years of experience with the v9 format and designed to be able to allow partial segment downloads to greatly improve the efficiency and responsiveness of the virtual storage fabric functionality introduced in #18176 (partial segment downloads are not part of this PR). Overall the changes are more of a remix than any major differences from v9. To streamline partial fetches, the base segment contents are combined into a single file, currently named druid.segment in this PR (thoughts on name welcome, i'm not terribly attached to this one).

Set druid.indexer.task.buildV10=true to make segments in the new format.

Layout

| version (byte) | meta compression (byte) | meta length (int) | meta json | chunk 0 | chunk 1 | ... | chunk n |

version: equivalent to version.bin in v9 format, a byte that indicates the segment version
meta compression, length, blob: unified segment metadata, the newly added SegmentFileMetadata
containers: equivalent to smoosh chunks of v9 format (e.g. 00000.smoosh etc), but concatenated together in favor of mapping ranges of the file based on offsets stored in the unified metadata.

SegmentFileMetadata

One of the bigger changes when compared to the V9 format is the consolidation of all the various metadata which is stored in the segment into a single json blob, SegmentFileMetadata. In the V9 segment format, metadata is split across a variety of places:
* meta.smoosh: The smoosh file has metadata about what internal files are present, and their offsets within the smoosh containers
* index.drd: list non-null columns, list of non-null dimensions, interval, bitmap factory, list of all columns including nulls, list of all dimensions including null only columns
* metadata.drd: Metadata contains aggs, timestampSpec, query granularity, rollup flag, ordering, list of projections
* ColumnDescriptor scattered across the internal files of the smoosh which contain type information and how to load a column supplier

This metadata has all been consolidated into a single place to make it easy to retrieve the metadata about both schema and layout which is the key to how V10 will be able to support partial downloads. Schema information is expressed as set of projections (including modeling the base table as a projection), and the ColumnDescriptor are pulled out of the column files and instead live in the metadata. In virtual storage mode, this metadata will be fetched on segment load, and since this metadata contains both where in the file the data is located and how to read it, will be able to fetch only the data which is actually required to complete the query.

External files

V10 format also supports the concept of 'external' segment containers, which can be 'attached' to the base segment to augment it with additional/optional data, for which this PR has very rudimentary support. This is a very experimental feature, our initial thinking is supporting use cases like optional indexes that can be downloaded separately (or even constructed at load time/on the fly). In the current implementation provided in this PR, column serializers can specify additional 'external' segment files to write contents to during segment creation, and readers can refer to these files during segment mapping.

In its current form this is more of an organizational feature; if used the external segment files will just be included and pushed to deep storage as part of publishing, and downloaded on fetch, but there are no actual column implementations using this at this time. Future work will expand on this functionality to realize the ideas suggested above.

Release note

todo


This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@github-actions github-actions bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Dec 31, 2025
@jtuglu1
Copy link
Contributor

jtuglu1 commented Jan 1, 2026

👍 A few questions:

  • Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.
  • Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).

…ooshFileMetadata to SegmentInternalFileMetadata, improve some tests to have an IndexMergerV10Factory available to avoid future debt if things want to test with v10
Copy link
Contributor

@gianm gianm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We've been running this segment format in production for some time and have had positive experiences with it, so my review focuses mostly on smaller things. Big picture, I am happy with the overall design.

ProjectionMetadata.forBaseTable(indexMergeResult.rowCount, mergedDimensionsWithTime, finalMetadata)
);
// convert v9 projections to v10 projections
for (AggregateProjectionMetadata aggMeta : finalMetadata.getProjections()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Up above there is some handling for what to do if segmentMetadata is null, but if it ever is actually null then this line will blow up. Is it meant to be required to be nonnull? If so remove the @Nullable and put a defensive null check earlier in the method.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a defensive check to ensure that it is not null; the interface allows it because v9 merger allows it, but i don't think it should be able to happen when merging v10 segments, also shuffled stuff around a bit to clean up

final byte[] metadataBytes = jsonMapper.writeValueAsBytes(segmentFileMetadata);

try (final FileOutputStream outputStream = new FileOutputStream(new File(baseDir, outputFileName))) {
// still need to make compression work... probably need to store both compressed and uncompressed lengths? no harm
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems fine to leave this for later but I'd prefer the comment to be worded more firmly as to what needs to be done if we ever want to add support for other compressions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just went ahead and added support for compression, which seems pretty nice to have at least in the extreme case of tons of projections in CursorFactoryProjectionTest:

$ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment"
-rw-r--r--  1 clint  staff    30K Jan  8 03:03 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment
$ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment"
-rw-r--r--  1 clint  staff   8.6K Jan  8 03:04 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment

}
}
// delete all the old 00000.smoosh
f.delete();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

delete() can fail without throwing an exception (it just returns false). would be good to check for this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added defensive check for this

v10Smoosher.addProjections(projections);

progress.progress();
v10Smoosher.close();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this won't be closed on exception (since it's not in the closer). Is that ok?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IndexMergerV9 is like this too (i started this thing by copying it heh), it does look like it perhaps leaves an open filestream/channel for currOut of the v9 smoosher (which v10 file builder currently uses to build intermediary files to concatenate into a v10 file).

Calling close looks like it would try to finish writing whatever segment it had built up so far, which seems kind of useless, so i have added an abort method to SegmentFileBuilder interface to give them a chance to cleanup resources without finishing the segment file

@gianm
Copy link
Contributor

gianm commented Jan 8, 2026

👍 A few questions:

* Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.

Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing.

* Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).

For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest.

@jtuglu1
Copy link
Contributor

jtuglu1 commented Jan 8, 2026

👍 A few questions:

* Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.

Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing.

* Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).

For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest.

I haven't heard much of people using Arrow for data at rest.
Check out https://github.com/lancedb/lance!

@clintropolis
Copy link
Member Author

Check out https://github.com/lancedb/lance!

Interesting, but unless I am missing something this looks like it just has its own 'at rest' format inspired by some Arrow and Parquet stuff, see https://lance.org/format/file/ and https://lance.org/format/file/encoding/, and can convert to actual Arrow format for interop stuff. Reading through ^ there is a lot of overlap in how we do things in our format, we just have not formalized/genericized the various 'structural encodings' as they call them and are internal implementation details of column serializer/deserializers, and some differences in how metadata about the contents is stored.

@jtuglu1
Copy link
Contributor

jtuglu1 commented Jan 8, 2026

Check out https://github.com/lancedb/lance!

Interesting, but unless I am missing something this looks like it just has its own 'at rest' format inspired by some Arrow and Parquet stuff, see https://lance.org/format/file/ and https://lance.org/format/file/encoding/, and can convert to actual Arrow format for interop stuff. Reading through ^ there is a lot of overlap in how we do things in our format, we just have not formalized/genericized the various 'structural encodings' as they call them and are internal implementation details of column serializer/deserializers, and some differences in how metadata about the contents is stored.

Yeah they have their own file format (at rest storage) but they use arrow for all IPC transfer. The reader/writers of the file data also all return Arrow buffers. This makes it super easy to read/write Lance files with other things that speak Arrow (pandas, datafusion, other DB engines, etc.).

@cryptoe cryptoe merged commit 7fc0515 into apache:master Jan 9, 2026
71 of 72 checks passed
@clintropolis clintropolis deleted the v10-segment-format branch January 9, 2026 09:23
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Area - Batch Ingestion Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 Area - Segment Format and Ser/De

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants