Conversation
There was a problem hiding this comment.
Pull request overview
Implements row-group pruning (via bbox metadata) and refactors the async collection pipeline to improve OMF retrieval performance (Issue #83).
Changes:
- Added task-based async pipeline that loads Parquet metadata up-front and schedules work per row-group chunk.
- Implemented bbox-based row-group pruning using Parquet column statistics.
- Extended row filter configuration with validation for combined filters and bbox extraction to enable pruning.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| rust/bambam-omf/src/collection/mod.rs | Registers the new collector_ops module. |
| rust/bambam-omf/src/collection/collector_ops.rs | Adds metadata→task processing, row-group pruning, and stream construction per row-group chunk. |
| rust/bambam-omf/src/collection/collector.rs | Refactors collection to a two-phase async pipeline (metadata/tasks then batch retrieval) and updates the ignored integration test expectations. |
| rust/bambam-omf/src/collection/collector_config.rs | Updates collector construction signature (but currently drops batch_size usage). |
| rust/bambam-omf/src/collection/filter/mod.rs | Exposes new RowGroupFilter trait. |
| rust/bambam-omf/src/collection/filter/row_group_filter.rs | Introduces a trait interface for row-group pruning. |
| rust/bambam-omf/src/collection/filter/row_filter_config.rs | Adds combined-filter uniqueness validation and helper to extract bbox filters for pruning. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| pub fn validate_unique_variant(&self) -> Result<(), OvertureMapsCollectionError> { | ||
| match self { | ||
| Self::Combined { filters } => { | ||
| let mut seen_variants: HashSet<std::mem::Discriminant<RowFilterConfig>> = |
There was a problem hiding this comment.
this is interesting, first i've seen of std::mem::disciriminant in action.
There was a problem hiding this comment.
looks good. i have a few small requests for doc comments, and i've made a follow-on issue for a few small user-focused features to add on top of this implementation.
i'm trying to run this now just to get a sanity check on performance. but it's been hanging for me. can you run this command and tell me what kind of runtimes you get for this very small network boundary?
RUST_LOG=info ./rust/target/release/bambam-omf network -c configuration/bambam-omf/travel-mode-filter.json -b -105.173256,-105.164802,39.817799,39.823534 -o out/test
one other sanity check we may want to add: if i got a bbox min and max mixed up, the requested box would suddenly cover the entire world. we should probably put some sanity check, something like making sure the width of the box is not greater than 180 and the height is not greater than 90.
|
Quickly ran the sanity check to get a good and a bad news: |
robfitzgerald
left a comment
There was a problem hiding this comment.
I'm going to say that both your good news and bad news are in fact good news... For this PR. I'll create a bug issue for the speeds panic. Thanks Yamil, nice work, these are great improvements!
|
Created issue #92 for our bad news |
In this PR, I implement row_group pruning to optimize OMF data retrieval. I also refactor the async pipeline to improve readability and control over the parallelism.
I was finally able to leave only two
awaits, both with real meaning. The first retrieves the row group metadata, and the second collects the batches based on that information.Performance is improved significantly. I would have liked to also include the deseralization
RecordBatch->OvertureRecordinto the same tokio pipeline but the types made it difficult. It was much easier to return the batches from the tokio pipeline and then just use rayon to process everything offline.closes #83