Row group metadata optimization by yamilbknsu · Pull Request #88 · NatLabRockies/bambam

yamilbknsu · 2026-01-23T22:13:56Z

In this PR, I implement row_group pruning to optimize OMF data retrieval. I also refactor the async pipeline to improve readability and control over the parallelism.

I was finally able to leave only two awaits, both with real meaning. The first retrieves the row group metadata, and the second collects the batches based on that information.

Performance is improved significantly. I would have liked to also include the deseralization RecordBatch->OvertureRecord into the same tokio pipeline but the types made it difficult. It was much easier to return the batches from the tokio pipeline and then just use rayon to process everything offline.

closes #83

Copilot

Pull request overview

Implements row-group pruning (via bbox metadata) and refactors the async collection pipeline to improve OMF retrieval performance (Issue #83).

Changes:

Added task-based async pipeline that loads Parquet metadata up-front and schedules work per row-group chunk.
Implemented bbox-based row-group pruning using Parquet column statistics.
Extended row filter configuration with validation for combined filters and bbox extraction to enable pruning.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 6 comments.

Show a summary per file

File	Description
rust/bambam-omf/src/collection/mod.rs	Registers the new `collector_ops` module.
rust/bambam-omf/src/collection/collector_ops.rs	Adds metadata→task processing, row-group pruning, and stream construction per row-group chunk.
rust/bambam-omf/src/collection/collector.rs	Refactors collection to a two-phase async pipeline (metadata/tasks then batch retrieval) and updates the ignored integration test expectations.
rust/bambam-omf/src/collection/collector_config.rs	Updates collector construction signature (but currently drops `batch_size` usage).
rust/bambam-omf/src/collection/filter/mod.rs	Exposes new `RowGroupFilter` trait.
rust/bambam-omf/src/collection/filter/row_group_filter.rs	Introduces a trait interface for row-group pruning.
rust/bambam-omf/src/collection/filter/row_filter_config.rs	Adds combined-filter uniqueness validation and helper to extract bbox filters for pruning.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

rust/bambam-omf/src/collection/collector.rs

rust/bambam-omf/src/collection/collector_ops.rs

rust/bambam-omf/src/collection/collector_config.rs

rust/bambam-omf/src/collection/collector.rs

robfitzgerald · 2026-01-26T20:09:59Z

rust/bambam-omf/src/collection/filter/row_filter_config.rs

+    pub fn validate_unique_variant(&self) -> Result<(), OvertureMapsCollectionError> {
+        match self {
+            Self::Combined { filters } => {
+                let mut seen_variants: HashSet<std::mem::Discriminant<RowFilterConfig>> =


this is interesting, first i've seen of std::mem::disciriminant in action.

robfitzgerald

looks good. i have a few small requests for doc comments, and i've made a follow-on issue for a few small user-focused features to add on top of this implementation.

i'm trying to run this now just to get a sanity check on performance. but it's been hanging for me. can you run this command and tell me what kind of runtimes you get for this very small network boundary?

RUST_LOG=info ./rust/target/release/bambam-omf network -c configuration/bambam-omf/travel-mode-filter.json -b -105.173256,-105.164802,39.817799,39.823534 -o out/test

one other sanity check we may want to add: if i got a bbox min and max mixed up, the requested box would suddenly cover the entire world. we should probably put some sanity check, something like making sure the width of the box is not greater than 180 and the height is not greater than 90.

rust/bambam-omf/src/collection/collector_ops.rs

rust/bambam-omf/src/app/network.rs

yamilbknsu · 2026-01-26T22:28:49Z

Quickly ran the sanity check to get a good and a bad news:

 RUST_LOG=info ./rust/target/release/bambam_omf network -c configuration/bambam-omf/travel-mode-filter.json -b -105.173256,-105.164802,39.817799,39.823534 -o out/test                                   
[2026-01-26T22:26:36Z INFO  bambam_omf::app::network] running OMF import with
            object store AmazonS3
            rg_chunk_size 4
            file_concurrency_limit 64
            release latest
            (xmin,xmax,ymin,ymax): -105.173256,-105.1648,39.8178,39.823532
[2026-01-26T22:26:37Z INFO  bambam_omf::collection::collector] Collecting OvertureMaps Connector records from release 2026-01-21.0
[2026-01-26T22:26:38Z INFO  bambam_omf::collection::collector] Started collection
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Collection time 8.999544084s
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Deserialization time 1.1105ms
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Total time 9.000756292s
[2026-01-26T22:26:47Z INFO  bambam_omf::collection::collector] Collecting OvertureMaps Segment records from release 2026-01-21.0
[2026-01-26T22:26:48Z INFO  bambam_omf::collection::collector] Started collection
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Collection time 10.015029208s
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Deserialization time 452.75µs
[2026-01-26T22:26:58Z INFO  bambam_omf::collection::collector] Total time 10.015506542s

[2026-01-26T22:26:58Z INFO  bambam_omf::graph::serialize_ops] all connectors accounted for

thread 'main' panicked at bambam-omf/src/bin/bambam_omf.rs:7:19:
called `Result::unwrap()` on an `Err` value: InternalError("internal division by zero when computing average speed: [None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None, None]")
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

robfitzgerald

I'm going to say that both your good news and bad news are in fact good news... For this PR. I'll create a bug issue for the speeds panic. Thanks Yamil, nice work, these are great improvements!

robfitzgerald · 2026-01-26T23:29:38Z

Created issue #92 for our bad news

yamilbknsu added 3 commits January 22, 2026 15:33

working solution for pruning row groups

cff3143

refactored async collection

a8ecfa7

clippy and configuration

4aec2aa

yamilbknsu requested review from Copilot and robfitzgerald January 23, 2026 22:13

Copilot started reviewing on behalf of yamilbknsu January 23, 2026 22:14 View session

delete unnecessary module

7f2d635

Copilot AI reviewed Jan 23, 2026

View reviewed changes

yamilbknsu added 5 commits January 23, 2026 17:23

fix config

7e82642

clippy

b739a4b

copilot suggestion

97203e7

another copilot suggestion

835f25f

fmt

458a506

robfitzgerald reviewed Jan 26, 2026

View reviewed changes

robfitzgerald mentioned this pull request Jan 26, 2026

bambam-omf download features #91

Open

robfitzgerald requested changes Jan 26, 2026

View reviewed changes

rust/bambam-omf/src/collection/collector_ops.rs Show resolved Hide resolved

rust/bambam-omf/src/collection/collector_ops.rs Show resolved Hide resolved

rust/bambam-omf/src/app/network.rs Show resolved Hide resolved

robfitzgerald approved these changes Jan 26, 2026

View reviewed changes

robfitzgerald mentioned this pull request Jan 26, 2026

Handling edge cases in speed averaging #92

Closed

yamilbknsu added 4 commits January 27, 2026 12:00

docstrings

fa45ecc

Bbox validation

71f4e42

fmt

6d0fe51

clippy

31692be

yamilbknsu merged commit 79fc620 into main Jan 27, 2026
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Row group metadata optimization#88

Row group metadata optimization#88
yamilbknsu merged 13 commits intomainfrom
yep/row-group-metadata-optimization

yamilbknsu commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robfitzgerald Jan 26, 2026

Uh oh!

robfitzgerald left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yamilbknsu commented Jan 26, 2026

Uh oh!

robfitzgerald left a comment

Uh oh!

robfitzgerald commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

yamilbknsu commented Jan 23, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

robfitzgerald Jan 26, 2026

Choose a reason for hiding this comment

Uh oh!

robfitzgerald left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

yamilbknsu commented Jan 26, 2026

Uh oh!

robfitzgerald left a comment

Choose a reason for hiding this comment

Uh oh!

robfitzgerald commented Jan 26, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

robfitzgerald left a comment •

edited

Loading