Scan Delete Support Part 3: ArrowReader::build_deletes_row_selection implementation#951
Merged
liurenjie1024 merged 4 commits intoapache:mainfrom Apr 7, 2025
Conversation
ArrowReader::build_deletes_row_selection implementation
f4b6d94 to
a52fe50
Compare
2fc5f70 to
26dc78f
Compare
ArrowReader::build_deletes_row_selection implementationArrowReader::build_deletes_row_selection implementation
This was referenced Feb 21, 2025
jonathanc-n
reviewed
Mar 4, 2025
Contributor
jonathanc-n
left a comment
There was a problem hiding this comment.
Overall looks good, need more eyes, might've missed something
2029d21 to
79b2162
Compare
liurenjie1024
pushed a commit
that referenced
this pull request
Mar 20, 2025
…se in `ArrowReader` (#950) Second part of delete file read support. See #630. This PR provides the basis for delete file support within `ArrowReader`. `DeleteFileManager` is introduced, in skeleton form. Full implementation of its behaviour will be submitted in follow-up PRs. `DeleteFileManager` is responsible for loading and parsing positional and equality delete files from `FileIO`. Once delete files for a task have been loaded and parsed, `ArrowReader::process_file_scan_task` uses the resulting `DeleteFileManager` in two places: * `DeleteFileManager::get_delete_vector_for_task` is passed a data file path and will return an ~`Option<Vec<usize>>`~ `Option<RoaringTreeMap>` containing the indices of all rows that are positionally deleted in that data file (or `None` if there are none) * `DeleteFileManager::build_delete_predicate` is invoked with the schema from the file scan task. It will return an `Option<BoundPredicate>` representing the filter predicate derived from all of the applicable equality deletes being transformed into predicates, logically joined into a single predicate and then bound to the schema (or `None` if there are no applicable equality deletes) This PR integrates the skeleton of the `DeleteFileManager` into `ArrowReader::process_file_scan_task`, extending the `RowFilter` and `RowSelection` logic to take into account any `RowFilter` that results from equality deletes and any `RowSelection` that results from positional deletes. ## Updates: * refactored `DeleteFileManager` so that `get_positional_delete_indexes_for_data_file` returns a `RoaringTreemap` rather than a `Vec<usize>`. This was based on @liurenjie1024's recommendation in a comment on the v1 PR, and makes a lot of sense from a performance perspective and made it easier to implement `ArrowReader::build_deletes_row_selection` in the follow-up PR to this one, #951 * `DeleteFileManager` is instantiated in the `ArrowReader` constructor rather than per-scan-task, so that delete files that apply to more than one task don't end up getting loaded and parsed twice ## Potential further enhancements: * Go one step further and move loading of delete files, and parsing of positional delete files, into `ObjectCache` to ensure that loading and parsing of the same files persists across scans
Contributor
|
cc @sdd Would you help to resolve conflicts first? |
51ce760 to
309a6ea
Compare
sdd
commented
Mar 28, 2025
| regex = "1.10.5" | ||
| reqwest = { version = "0.12.2", default-features = false, features = ["json"] } | ||
| roaring = "0.10" | ||
| roaring = { version = "0.10", git = "https://github.com/RoaringBitmap/roaring-rs.git" } |
Contributor
Author
There was a problem hiding this comment.
Waiting on this to be released by roaring-rs so that we don't need the git ref here
Contributor
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @sdd for this pr, just one minor suggestion to improve tests, others looks great!
| } | ||
|
|
||
| #[test] | ||
| fn test_build_deletes_row_selection() { |
Contributor
There was a problem hiding this comment.
This test is not enough, should we consider using property test to generate random data to verify it?
Contributor
Author
There was a problem hiding this comment.
Good suggestion. Will do.
309a6ea to
caeb720
Compare
liurenjie1024
approved these changes
Apr 7, 2025
Contributor
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @sdd , LGTM!
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Third part of delete file read support. See #630
**Builds on top of #950
build_deletes_row_selectioncomputes aRowSelectionfrom aRoaringTreemaprepresenting the indexes of rows in a data file that have been marked as deleted by positional delete files that apply to the data file being read (and, in the future, delete vectors).The resulting
RowSelectionwill be merged with aRowSelectionresulting from the scan's filter predicate (if present) and supplied to theParquetRecordBatchStreamBuilderso that deleted rows are omitted from theRecordBatchStreamreturned by the reader.NB: I encountered quite a few edge cases in this method and the logic is quite complex. There is a good chance that a keen-eyed reviewer would be able to conceive of an edge-case that I haven't covered.