-
Notifications
You must be signed in to change notification settings - Fork 3.8k
segment format v10 #18880
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
segment format v10 #18880
Conversation
71b3b1e to
fcd4db0
Compare
processing/src/main/java/org/apache/druid/segment/SegmentUtils.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/segment/SegmentUtils.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/segment/file/SegmentFileMapperV10.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/segment/file/SegmentFileMapperV10.java
Dismissed
Show dismissed
Hide dismissed
processing/src/main/java/org/apache/druid/segment/file/SegmentFileBuilderV10.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/projections/AggregateProjectionSchema.java
Fixed
Show fixed
Hide fixed
processing/src/main/java/org/apache/druid/segment/projections/TableProjectionSchema.java
Fixed
Show fixed
Hide fixed
processing/src/test/java/org/apache/druid/segment/projections/ProjectionMetadataTest.java
Fixed
Show fixed
Hide fixed
|
👍 A few questions:
|
processing/src/main/java/org/apache/druid/segment/projections/AggregateProjectionSchema.java
Fixed
Show fixed
Hide fixed
gianm
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We've been running this segment format in production for some time and have had positive experiences with it, so my review focuses mostly on smaller things. Big picture, I am happy with the overall design.
| ProjectionMetadata.forBaseTable(indexMergeResult.rowCount, mergedDimensionsWithTime, finalMetadata) | ||
| ); | ||
| // convert v9 projections to v10 projections | ||
| for (AggregateProjectionMetadata aggMeta : finalMetadata.getProjections()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Up above there is some handling for what to do if segmentMetadata is null, but if it ever is actually null then this line will blow up. Is it meant to be required to be nonnull? If so remove the @Nullable and put a defensive null check earlier in the method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a defensive check to ensure that it is not null; the interface allows it because v9 merger allows it, but i don't think it should be able to happen when merging v10 segments, also shuffled stuff around a bit to clean up
| final byte[] metadataBytes = jsonMapper.writeValueAsBytes(segmentFileMetadata); | ||
|
|
||
| try (final FileOutputStream outputStream = new FileOutputStream(new File(baseDir, outputFileName))) { | ||
| // still need to make compression work... probably need to store both compressed and uncompressed lengths? no harm |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems fine to leave this for later but I'd prefer the comment to be worded more firmly as to what needs to be done if we ever want to add support for other compressions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just went ahead and added support for compression, which seems pretty nice to have at least in the extreme case of tons of projections in CursorFactoryProjectionTest:
$ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment"
-rw-r--r-- 1 clint staff 30K Jan 8 03:03 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8917904562844009022/testIndex-247023712/druid.segment
$ ls -lh "/var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment"
-rw-r--r-- 1 clint staff 8.6K Jan 8 03:04 /var/folders/8y/mhfmxp391pl9m2h103s_kn200000gn/T/druid8829079505499972074/testIndex-770267032/druid.segment
| } | ||
| } | ||
| // delete all the old 00000.smoosh | ||
| f.delete(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
delete() can fail without throwing an exception (it just returns false). would be good to check for this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added defensive check for this
| v10Smoosher.addProjections(projections); | ||
|
|
||
| progress.progress(); | ||
| v10Smoosher.close(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this won't be closed on exception (since it's not in the closer). Is that ok?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IndexMergerV9 is like this too (i started this thing by copying it heh), it does look like it perhaps leaves an open filestream/channel for currOut of the v9 smoosher (which v10 file builder currently uses to build intermediary files to concatenate into a v10 file).
Calling close looks like it would try to finish writing whatever segment it had built up so far, which seems kind of useless, so i have added an abort method to SegmentFileBuilder interface to give them a chance to cleanup resources without finishing the segment file
Definitely supporting partial downloads at the level of columns and/or projections is a goal of this format, and something it would enable doing.
For intra-cluster data transfer, the MSQ query paths (which to me are the ones I want to focus on 😄) are using Frames, which are similar to Arrow in efficiency. For integrating with the big data ecosystem in ways that require actually using Arrow, there is a question about whether we're doing something for data in flight (RPC) or for data at rest (in object storage). For RPC I think an API that returns Arrow streams can make sense in theory. It wouldn't be related to the segment format, it would be more related to the query side. For data at rest, I don't know how much sense that makes. I haven't heard much of people using Arrow for data at rest. |
|
Interesting, but unless I am missing something this looks like it just has its own 'at rest' format inspired by some Arrow and Parquet stuff, see https://lance.org/format/file/ and https://lance.org/format/file/encoding/, and can convert to actual Arrow format for interop stuff. Reading through ^ there is a lot of overlap in how we do things in our format, we just have not formalized/genericized the various 'structural encodings' as they call them and are internal implementation details of column serializer/deserializers, and some differences in how metadata about the contents is stored. |
Yeah they have their own file format (at rest storage) but they use arrow for all IPC transfer. The reader/writers of the file data also all return Arrow buffers. This makes it super easy to read/write Lance files with other things that speak Arrow (pandas, datafusion, other DB engines, etc.). |
Description
This PR introduces a new segment format, taking learning from years of experience with the v9 format and designed to be able to allow partial segment downloads to greatly improve the efficiency and responsiveness of the virtual storage fabric functionality introduced in #18176 (partial segment downloads are not part of this PR). Overall the changes are more of a remix than any major differences from v9. To streamline partial fetches, the base segment contents are combined into a single file, currently named
druid.segmentin this PR (thoughts on name welcome, i'm not terribly attached to this one).Set
druid.indexer.task.buildV10=trueto make segments in the new format.Layout
version: equivalent to version.bin in v9 format, a byte that indicates the segment version
meta compression, length, blob: unified segment metadata, the newly added
SegmentFileMetadatacontainers: equivalent to smoosh chunks of v9 format (e.g. 00000.smoosh etc), but concatenated together in favor of mapping ranges of the file based on offsets stored in the unified metadata.
SegmentFileMetadata
One of the bigger changes when compared to the V9 format is the consolidation of all the various metadata which is stored in the segment into a single json blob,
SegmentFileMetadata. In the V9 segment format, metadata is split across a variety of places:* meta.smoosh: The smoosh file has metadata about what internal files are present, and their offsets within the smoosh containers
* index.drd: list non-null columns, list of non-null dimensions, interval, bitmap factory, list of all columns including nulls, list of all dimensions including null only columns
* metadata.drd: Metadata contains aggs, timestampSpec, query granularity, rollup flag, ordering, list of projections
*
ColumnDescriptorscattered across the internal files of the smoosh which contain type information and how to load a column supplierThis metadata has all been consolidated into a single place to make it easy to retrieve the metadata about both schema and layout which is the key to how V10 will be able to support partial downloads. Schema information is expressed as set of projections (including modeling the base table as a projection), and the
ColumnDescriptorare pulled out of the column files and instead live in the metadata. In virtual storage mode, this metadata will be fetched on segment load, and since this metadata contains both where in the file the data is located and how to read it, will be able to fetch only the data which is actually required to complete the query.External files
V10 format also supports the concept of 'external' segment containers, which can be 'attached' to the base segment to augment it with additional/optional data, for which this PR has very rudimentary support. This is a very experimental feature, our initial thinking is supporting use cases like optional indexes that can be downloaded separately (or even constructed at load time/on the fly). In the current implementation provided in this PR, column serializers can specify additional 'external' segment files to write contents to during segment creation, and readers can refer to these files during segment mapping.
In its current form this is more of an organizational feature; if used the external segment files will just be included and pushed to deep storage as part of publishing, and downloaded on fetch, but there are no actual column implementations using this at this time. Future work will expand on this functionality to realize the ideas suggested above.
Release note
todo
This PR has: