segment format v10 by clintropolis · Pull Request #18880 · apache/druid

clintropolis · 2025-12-31T23:04:20Z

Description

This PR introduces a new segment format, taking learning from years of experience with the v9 format and designed to be able to allow partial segment downloads to greatly improve the efficiency and responsiveness of the virtual storage fabric functionality introduced in #18176 (partial segment downloads are not part of this PR). Overall the changes are more of a remix than any major differences from v9. To streamline partial fetches, the base segment contents are combined into a single file, currently named druid.segment in this PR (thoughts on name welcome, i'm not terribly attached to this one).

Set druid.indexer.task.buildV10=true to make segments in the new format.

Layout

| version (byte) | meta compression (byte) | meta length (int) | meta json | chunk 0 | chunk 1 | ... | chunk n |

version: equivalent to version.bin in v9 format, a byte that indicates the segment version
meta compression, length, blob: unified segment metadata, the newly added SegmentFileMetadata
containers: equivalent to smoosh chunks of v9 format (e.g. 00000.smoosh etc), but concatenated together in favor of mapping ranges of the file based on offsets stored in the unified metadata.

SegmentFileMetadata

One of the bigger changes when compared to the V9 format is the consolidation of all the various metadata which is stored in the segment into a single json blob, SegmentFileMetadata. In the V9 segment format, metadata is split across a variety of places:
* meta.smoosh: The smoosh file has metadata about what internal files are present, and their offsets within the smoosh containers
* index.drd: list non-null columns, list of non-null dimensions, interval, bitmap factory, list of all columns including nulls, list of all dimensions including null only columns
* metadata.drd: Metadata contains aggs, timestampSpec, query granularity, rollup flag, ordering, list of projections
* ColumnDescriptor scattered across the internal files of the smoosh which contain type information and how to load a column supplier

This metadata has all been consolidated into a single place to make it easy to retrieve the metadata about both schema and layout which is the key to how V10 will be able to support partial downloads. Schema information is expressed as set of projections (including modeling the base table as a projection), and the ColumnDescriptor are pulled out of the column files and instead live in the metadata. In virtual storage mode, this metadata will be fetched on segment load, and since this metadata contains both where in the file the data is located and how to read it, will be able to fetch only the data which is actually required to complete the query.

External files

V10 format also supports the concept of 'external' segment containers, which can be 'attached' to the base segment to augment it with additional/optional data, for which this PR has very rudimentary support. This is a very experimental feature, our initial thinking is supporting use cases like optional indexes that can be downloaded separately (or even constructed at load time/on the fly). In the current implementation provided in this PR, column serializers can specify additional 'external' segment files to write contents to during segment creation, and readers can refer to these files during segment mapping.

In its current form this is more of an organizational feature; if used the external segment files will just be included and pushed to deep storage as part of publishing, and downloaded on fetch, but there are no actual column implementations using this at this time. Future work will expand on this functionality to realize the ideas suggested above.

Release note

todo

This PR has:

processing/src/main/java/org/apache/druid/segment/SegmentUtils.java

processing/src/main/java/org/apache/druid/segment/file/SegmentFileMapperV10.java

processing/src/main/java/org/apache/druid/segment/IndexMergerV10.java

processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java

processing/src/main/java/org/apache/druid/segment/projections/AggregateProjectionSchema.java

processing/src/main/java/org/apache/druid/segment/projections/TableProjectionSchema.java

processing/src/test/java/org/apache/druid/segment/projections/ProjectionMetadataTest.java

jtuglu1 · 2026-01-01T19:13:49Z

👍 A few questions:

Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.
Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).

processing/src/main/java/org/apache/druid/segment/projections/AggregateProjectionSchema.java

…ooshFileMetadata to SegmentInternalFileMetadata, improve some tests to have an IndexMergerV10Factory available to avoid future debt if things want to test with v10

gianm

We've been running this segment format in production for some time and have had positive experiences with it, so my review focuses mostly on smaller things. Big picture, I am happy with the overall design.

gianm · 2026-01-07T23:53:15Z

processing/src/main/java/org/apache/druid/segment/IndexMergerV10.java

+          ProjectionMetadata.forBaseTable(indexMergeResult.rowCount, mergedDimensionsWithTime, finalMetadata)
+      );
+      // convert v9 projections to v10 projections
+      for (AggregateProjectionMetadata aggMeta : finalMetadata.getProjections()) {


Up above there is some handling for what to do if segmentMetadata is null, but if it ever is actually null then this line will blow up. Is it meant to be required to be nonnull? If so remove the @Nullable and put a defensive null check earlier in the method.

added a defensive check to ensure that it is not null; the interface allows it because v9 merger allows it, but i don't think it should be able to happen when merging v10 segments, also shuffled stuff around a bit to clean up

gianm · 2026-01-07T23:54:05Z

processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java

+    final byte[] metadataBytes = jsonMapper.writeValueAsBytes(segmentFileMetadata);
+
+    try (final FileOutputStream outputStream = new FileOutputStream(new File(baseDir, outputFileName))) {
+      // still need to make compression work... probably need to store both compressed and uncompressed lengths? no harm


It seems fine to leave this for later but I'd prefer the comment to be worded more firmly as to what needs to be done if we ever want to add support for other compressions.

just went ahead and added support for compression, which seems pretty nice to have at least in the extreme case of tons of projections in CursorFactoryProjectionTest:

$ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment" -rw-r--r-- 1 clint staff 30K Jan 8 03:03 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment $ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment" -rw-r--r-- 1 clint staff 8.6K Jan 8 03:04 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment

gianm · 2026-01-07T23:56:24Z

processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java

+          }
+        }
+        // delete all the old 00000.smoosh
+        f.delete();


delete() can fail without throwing an exception (it just returns false). would be good to check for this.

added defensive check for this

gianm · 2026-01-08T00:00:24Z

processing/src/main/java/org/apache/druid/segment/IndexMergerV10.java

+      v10Smoosher.addProjections(projections);
+
+      progress.progress();
+      v10Smoosher.close();


I think this won't be closed on exception (since it's not in the closer). Is that ok?

IndexMergerV9 is like this too (i started this thing by copying it heh), it does look like it perhaps leaves an open filestream/channel for currOut of the v9 smoosher (which v10 file builder currently uses to build intermediary files to concatenate into a v10 file).

Calling close looks like it would try to finish writing whatever segment it had built up so far, which seems kind of useless, so i have added an abort method to SegmentFileBuilder interface to give them a chance to cleanup resources without finishing the segment file

gianm · 2026-01-08T00:11:15Z

👍 A few questions:

* Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.

Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing.

* Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).

For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest.

jtuglu1 · 2026-01-08T00:15:34Z

👍 A few questions:
* Will this support reading a single column from a segment (instead of needing to download + scan entire segment)? I guess mapping offset ranges of a segment file is analogous to the row-group concept in Parquet.
Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing.
* Are there any thoughts to make Druid formats Arrow-compatible? This would open up many more integrations with existing big data ecosystem externally, as well as making intra-cluster data transfer potentially much faster (send everything as RecordBatch).
For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest.

I haven't heard much of people using Arrow for data at rest.
Check out https://github.com/lancedb/lance!

clintropolis · 2026-01-08T02:05:45Z

Check out https://github.com/lancedb/lance!

Interesting, but unless I am missing something this looks like it just has its own 'at rest' format inspired by some Arrow and Parquet stuff, see https://lance.org/format/file/ and https://lance.org/format/file/encoding/, and can convert to actual Arrow format for interop stuff. Reading through ^ there is a lot of overlap in how we do things in our format, we just have not formalized/genericized the various 'structural encodings' as they call them and are internal implementation details of column serializer/deserializers, and some differences in how metadata about the contents is stored.

jtuglu1 · 2026-01-08T02:09:13Z

Check out https://github.com/lancedb/lance!

Interesting, but unless I am missing something this looks like it just has its own 'at rest' format inspired by some Arrow and Parquet stuff, see https://lance.org/format/file/ and https://lance.org/format/file/encoding/, and can convert to actual Arrow format for interop stuff. Reading through ^ there is a lot of overlap in how we do things in our format, we just have not formalized/genericized the various 'structural encodings' as they call them and are internal implementation details of column serializer/deserializers, and some differences in how metadata about the contents is stored.

Yeah they have their own file format (at rest storage) but they use arrow for all IPC transfer. The reader/writers of the file data also all return Arrow buffers. This makes it super easy to read/write Lance files with other things that speak Arrow (pandas, datafusion, other DB engines, etc.).

processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java

segment format v10

fcd4db0

github-actions bot added Area - Batch Ingestion Area - Segment Format and Ser/De Area - Ingestion Area - MSQ For multi stage queries - https://github.com/apache/druid/issues/12262 labels Dec 31, 2025

clintropolis force-pushed the v10-segment-format branch from 71b3b1e to fcd4db0 Compare December 31, 2025 23:04

github-advanced-security bot found potential problems Dec 31, 2025

View reviewed changes

clintropolis added 2 commits January 5, 2026 20:32

fixes, more test

1652e02

Merge remote-tracking branch 'upstream/master' into v10-segment-format

28a8ee7

github-advanced-security bot found potential problems Jan 6, 2026

View reviewed changes

processing/src/main/java/org/apache/druid/segment/projections/AggregateProjectionSchema.java Fixed Show fixed Hide fixed

clintropolis added 3 commits January 5, 2026 22:44

fixes

15ba766

fix

20110e4

rename SmooshContainerMetadata to SegmentFileContainerMetadata and Sm…

33c42b5

…ooshFileMetadata to SegmentInternalFileMetadata, improve some tests to have an IndexMergerV10Factory available to avoid future debt if things want to test with v10

gianm reviewed Jan 8, 2026

View reviewed changes

clintropolis added 2 commits January 8, 2026 03:29

add support for compression, review fixup

1e35eba

nicer

09923a3

github-advanced-security bot found potential problems Jan 8, 2026

View reviewed changes

processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java Dismissed Show dismissed Hide dismissed

fix test

c5c741e

gianm approved these changes Jan 8, 2026

View reviewed changes

clintropolis added 2 commits January 8, 2026 13:19

bleh

b3ce26b

Merge remote-tracking branch 'upstream/master' into v10-segment-format

4e6c147

cryptoe merged commit 7fc0515 into apache:master Jan 9, 2026
71 of 72 checks passed

clintropolis deleted the v10-segment-format branch January 9, 2026 09:23

This was referenced Jan 10, 2026

default IndexBuilder buildV10 to false #18902

Merged

fix issue with projection schema deserialization from existing v9 segment metadata #18916

Merged

kgyrtkirk added this to the 36.0.0 milestone Jan 19, 2026

clintropolis mentioned this pull request Feb 4, 2026

Support lz4 compression in hdfs #18982

Merged

10 tasks

Conversation

clintropolis commented Dec 31, 2025

Description

Layout

SegmentFileMetadata

External files

Release note

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

jtuglu1 commented Jan 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

gianm left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

gianm commented Jan 8, 2026

Uh oh!

jtuglu1 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

clintropolis commented Jan 8, 2026

Uh oh!

jtuglu1 commented Jan 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

jtuglu1 commented Jan 1, 2026 •

edited

Loading

jtuglu1 commented Jan 8, 2026 •

edited

Loading

jtuglu1 commented Jan 8, 2026 •

edited

Loading