Skip to content

Row group metadata optimization#88

Merged
yamilbknsu merged 13 commits intomainfrom
yep/row-group-metadata-optimization
Jan 27, 2026
Merged

Row group metadata optimization#88
yamilbknsu merged 13 commits intomainfrom
yep/row-group-metadata-optimization

Conversation

@yamilbknsu
Copy link
Collaborator

In this PR, I implement row_group pruning to optimize OMF data retrieval. I also refactor the async pipeline to improve readability and control over the parallelism.

I was finally able to leave only two awaits, both with real meaning. The first retrieves the row group metadata, and the second collects the batches based on that information.

Performance is improved significantly. I would have liked to also include the deseralization RecordBatch->OvertureRecord into the same tokio pipeline but the types made it difficult. It was much easier to return the batches from the tokio pipeline and then just use rayon to process everything offline.

closes #83

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Implements row-group pruning (via bbox metadata) and refactors the async collection pipeline to improve OMF retrieval performance (Issue #83).

Changes:

  • Added task-based async pipeline that loads Parquet metadata up-front and schedules work per row-group chunk.
  • Implemented bbox-based row-group pruning using Parquet column statistics.
  • Extended row filter configuration with validation for combined filters and bbox extraction to enable pruning.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
rust/bambam-omf/src/collection/mod.rs Registers the new collector_ops module.
rust/bambam-omf/src/collection/collector_ops.rs Adds metadata→task processing, row-group pruning, and stream construction per row-group chunk.
rust/bambam-omf/src/collection/collector.rs Refactors collection to a two-phase async pipeline (metadata/tasks then batch retrieval) and updates the ignored integration test expectations.
rust/bambam-omf/src/collection/collector_config.rs Updates collector construction signature (but currently drops batch_size usage).
rust/bambam-omf/src/collection/filter/mod.rs Exposes new RowGroupFilter trait.
rust/bambam-omf/src/collection/filter/row_group_filter.rs Introduces a trait interface for row-group pruning.
rust/bambam-omf/src/collection/filter/row_filter_config.rs Adds combined-filter uniqueness validation and helper to extract bbox filters for pruning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

pub fn validate_unique_variant(&self) -> Result<(), OvertureMapsCollectionError> {
match self {
Self::Combined { filters } => {
let mut seen_variants: HashSet<std::mem::Discriminant<RowFilterConfig>> =
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is interesting, first i've seen of std::mem::disciriminant in action.

Copy link
Collaborator

@robfitzgerald robfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good. i have a few small requests for doc comments, and i've made a follow-on issue for a few small user-focused features to add on top of this implementation.

i'm trying to run this now just to get a sanity check on performance. but it's been hanging for me. can you run this command and tell me what kind of runtimes you get for this very small network boundary?

RUST_LOG=info ./rust/target/release/bambam-omf network -c configuration/bambam-omf/travel-mode-filter.json -b -105.173256,-105.164802,39.817799,39.823534 -o out/test

one other sanity check we may want to add: if i got a bbox min and max mixed up, the requested box would suddenly cover the entire world. we should probably put some sanity check, something like making sure the width of the box is not greater than 180 and the height is not greater than 90.

@yamilbknsu
Copy link
Collaborator Author

Quickly ran the sanity check to get a good and a bad news:

 RUST_LOG=info ./rust/target/release/bambam_omf network -c configuration/bambam-omf/travel-mode-filter.json -b -105.173256,-105.164802,39.817799,39.823534 -o out/test                                   
[2026-01-26T22:26:36Z INFO  bambam_omf::app::network] running OMF import with
            object store AmazonS3
            rg_chunk_size 4
            file_concurrency_limit 64
            release latest
            (xmin,xmax,ymin,ymax): -105.173256,-105.1648,39.8178,39.823532
[2026-01-26T22:26:37Z INFO  bambam_omf::collection::collector] Collecting OvertureMaps Connector records from release 2026-01-21.0
[2026-01-26T22:26:38Z INFO  bambam_omf::collection::collector] Started collection
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Collection time 8.999544084s
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Deserialization time 1.1105ms
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Total time 9.000756292s
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Collecting OvertureMaps Segment records from release 2026-01-21.0
[2026-01-26T22:26:48Z INFO  bambam_omf::collection::collector] Started collection
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Collection time 10.015029208s
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Deserialization time 452.75µs
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Total time 10.015506542s

[2026-01-26T22:26:58Z INFO  bambam_omf::graph::serialize_ops] all connectors accounted for

thread 'main' panicked at bambam-omf/src/bin/bambam_omf.rs:7:19:
called `Result::unwrap()` on an `Err` value: InternalError("internal division by zero when computing average speed: [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Copy link
Collaborator

@robfitzgerald robfitzgerald left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm going to say that both your good news and bad news are in fact good news... For this PR. I'll create a bug issue for the speeds panic. Thanks Yamil, nice work, these are great improvements!

@robfitzgerald
Copy link
Collaborator

Created issue #92 for our bad news

@yamilbknsu yamilbknsu merged commit 79fc620 into main Jan 27, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

better OvertureMaps download performance

3 participants