feat: Add parsing-level projection pushdown for BAM/CRAM/VCF by mwiewior · Pull Request #288 · biodatageeks/polars-bio

mwiewior · 2026-02-09T17:35:12Z

Summary

Bump datafusion-bio-formats to rev 84605fcd (upstream PR #64) which adds parsing-level projection pushdown for BAM, CRAM, and VCF formats
Unprojected fields are now skipped entirely during record parsing — no string formatting, sequence decoding, map lookups, or memory allocation for unused columns
COUNT(*) queries use an empty projection path, avoiding parsing any fields while still counting records correctly

Changes

Cargo.toml / Cargo.lock: Update all 8 datafusion-bio-format-* crate revs
tests/test_execution_plan_validation.py: Un-skip test, rewrite to use physical plan DisplayAs format (VcfExec: projection=[chrom, start])
tests/test_projection_pushdown.py: Un-skip test, add 8 new tests — BAM/CRAM projection correctness, COUNT(*) for BAM/CRAM/VCF, EXPLAIN plan verification for BAM/CRAM/VCF
docs/features.md: Mark BAM/CRAM/VCF projection pushdown as supported in the feature matrix, add new "Projection pushdown" documentation section with usage examples
CHANGELOG.md: Document the upstream integration

Test plan

cargo check passes
maturin develop --release builds successfully
All 20 projection/plan tests pass (test_projection_pushdown.py + test_execution_plan_validation.py)
All 79 IO tests pass (test_io_bam.py, test_io_cram.py, test_io_vcf.py)
Per-column DisplayAs verification: every column for BAM (11), CRAM (11), and VCF (8) reports correctly in the physical plan

🤖 Generated with Claude Code

Bump datafusion-bio-formats to rev 84605fcd (upstream PR #64) which adds parsing-level projection pushdown. Unprojected fields are now skipped entirely during record parsing — no string formatting, sequence decoding, or memory allocation for unused columns. - Update all 8 datafusion-bio-format-* crate revs in Cargo.toml - Un-skip and rewrite EXPLAIN plan tests to use physical plan DisplayAs format (e.g. BamExec: projection=[name, chrom]) - Add new tests: BAM/CRAM projection correctness, COUNT(*) for all three formats, EXPLAIN plan verification for BAM/CRAM/VCF - Update features.md table and add projection pushdown documentation - Update CHANGELOG.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior · 2026-02-09T17:37:44Z

@claude pls review

claude · 2026-02-09T17:38:02Z

Claude finished @mwiewior's task in 1m 18s —— View job

Code Review

Summary

This PR successfully integrates parsing-level projection pushdown for BAM/CRAM/VCF formats by bumping datafusion-bio-formats to rev 84605fcd. The implementation looks solid and the test coverage is comprehensive. The feature will provide significant performance improvements for queries that only need a subset of columns.

Strengths

Clean dependency update: All 8 datafusion-bio-format-* crates are consistently updated to the same rev
Comprehensive test coverage: Added 8 new tests covering:
- BAM/CRAM projection correctness
- COUNT(*) queries for all three formats
- Physical plan verification via EXPLAIN
Documentation is thorough:
- Feature matrix updated correctly
- New "Projection pushdown" section with clear examples
- Helpful tip about COUNT(*) optimization
Test cleanup: Un-skipped and simplified test_execution_plan_validation.py by switching from logical plan inspection to physical plan DisplayAs format

Issues & Suggestions

1. Minor: Inconsistent documentation formatting

In docs/features.md line 721, there's a missing space in the tip directive:

!!! tip
    `COUNT(*)` queries also benefit — when no columns are needed...

This is correct, but the indentation should be consistent with other tips in the file (4 spaces). The current version looks correct though.

2. Test organization consideration

tests/test_projection_pushdown.py has grown significantly (from 37 to 177 lines, +140 lines). Consider if some tests could be factored into helper functions or if the file should be split for better maintainability. Not a blocker, but something to keep in mind for future refactoring.

3. Documentation: Missing example for CRAM

The new projection pushdown section (docs/features.md:682-721) shows examples for BAM and VCF, but doesn't explicitly mention CRAM despite it being supported. Consider adding a brief note like:

# Works with CRAM too
df = pb.scan_cram("alignments.cram", projection_pushdown=True).select(["name", "chrom"]).collect()

4. Question: Default behavior

The PR description states the feature requires projection_pushdown=True, but it would be helpful to document in docs/features.md why this isn't the default. Is there a performance trade-off or breaking change concern? A brief note explaining the opt-in design would help users understand when to enable it.

5. Test file naming

test_execution_plan_validation.py now only contains a single test function. Consider renaming the file to better reflect its focused purpose, or adding a docstring at the module level explaining its scope.

Minor Observations

CHANGELOG.md: The entry is well-written and clearly explains the benefits (line 39-42)
Feature matrix update: Correctly changed BAM/CRAM/VCF projection pushdown from "❌" / ":construction:" to ":white_check_mark:"
Test assertions: Good use of descriptive assertion messages with actual vs expected values
Code quality: The regex pattern in extract_projected_columns_from_plan() correctly handles all three formats

Verification

I've reviewed:

✅ Dependency version consistency across all 8 crates
✅ Test correctness and coverage
✅ Documentation accuracy and completeness
✅ CHANGELOG entry quality
✅ No obvious bugs or security issues

Recommendation

LGTM with minor suggestions. The core implementation is solid, tests are comprehensive, and documentation is clear. The suggestions above are non-blocking improvements that could be addressed in a follow-up PR if desired.

The parsing-level projection pushdown is a significant performance optimization, and this integration looks well-executed. Great work! 🎉

Projection pushdown is now enabled by default for all I/O methods (scan_*/read_*) and range operations (overlap, nearest, count_overlaps, coverage, merge). Pass projection_pushdown=False to opt out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Address review feedback: add explicit CRAM usage example alongside the existing BAM and VCF examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior and others added 2 commits February 9, 2026 18:43

docs: Add CRAM example to projection pushdown section

a8390cb

Address review feedback: add explicit CRAM usage example alongside the existing BAM and VCF examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

mwiewior merged commit a87fe8b into master Feb 9, 2026
13 checks passed

mwiewior deleted the feat/parsing-level-projection-pushdown branch February 9, 2026 18:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288

feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288
mwiewior merged 3 commits intomasterfrom
feat/parsing-level-projection-pushdown

mwiewior commented Feb 9, 2026

Uh oh!

mwiewior commented Feb 9, 2026

Uh oh!

claude bot commented Feb 9, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

mwiewior commented Feb 9, 2026

Summary

Changes

Test plan

Uh oh!

mwiewior commented Feb 9, 2026

Uh oh!

claude bot commented Feb 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Code Review

Summary

Strengths

Issues & Suggestions

1. Minor: Inconsistent documentation formatting

2. Test organization consideration

3. Documentation: Missing example for CRAM

4. Question: Default behavior

5. Test file naming

Minor Observations

Verification

Recommendation

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

claude bot commented Feb 9, 2026 •

edited

Loading