Skip to content

feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288

Merged
mwiewior merged 3 commits intomasterfrom
feat/parsing-level-projection-pushdown
Feb 9, 2026
Merged

feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288
mwiewior merged 3 commits intomasterfrom
feat/parsing-level-projection-pushdown

Conversation

@mwiewior
Copy link
Collaborator

@mwiewior mwiewior commented Feb 9, 2026

Summary

  • Bump datafusion-bio-formats to rev 84605fcd (upstream PR #64) which adds parsing-level projection pushdown for BAM, CRAM, and VCF formats
  • Unprojected fields are now skipped entirely during record parsing — no string formatting, sequence decoding, map lookups, or memory allocation for unused columns
  • COUNT(*) queries use an empty projection path, avoiding parsing any fields while still counting records correctly

Changes

  • Cargo.toml / Cargo.lock: Update all 8 datafusion-bio-format-* crate revs
  • tests/test_execution_plan_validation.py: Un-skip test, rewrite to use physical plan DisplayAs format (VcfExec: projection=[chrom, start])
  • tests/test_projection_pushdown.py: Un-skip test, add 8 new tests — BAM/CRAM projection correctness, COUNT(*) for BAM/CRAM/VCF, EXPLAIN plan verification for BAM/CRAM/VCF
  • docs/features.md: Mark BAM/CRAM/VCF projection pushdown as supported in the feature matrix, add new "Projection pushdown" documentation section with usage examples
  • CHANGELOG.md: Document the upstream integration

Test plan

  • cargo check passes
  • maturin develop --release builds successfully
  • All 20 projection/plan tests pass (test_projection_pushdown.py + test_execution_plan_validation.py)
  • All 79 IO tests pass (test_io_bam.py, test_io_cram.py, test_io_vcf.py)
  • Per-column DisplayAs verification: every column for BAM (11), CRAM (11), and VCF (8) reports correctly in the physical plan

🤖 Generated with Claude Code

Bump datafusion-bio-formats to rev 84605fcd (upstream PR #64) which
adds parsing-level projection pushdown. Unprojected fields are now
skipped entirely during record parsing — no string formatting, sequence
decoding, or memory allocation for unused columns.

- Update all 8 datafusion-bio-format-* crate revs in Cargo.toml
- Un-skip and rewrite EXPLAIN plan tests to use physical plan DisplayAs
  format (e.g. BamExec: projection=[name, chrom])
- Add new tests: BAM/CRAM projection correctness, COUNT(*) for all
  three formats, EXPLAIN plan verification for BAM/CRAM/VCF
- Update features.md table and add projection pushdown documentation
- Update CHANGELOG.md

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior
Copy link
Collaborator Author

mwiewior commented Feb 9, 2026

@claude pls review

@claude
Copy link

claude bot commented Feb 9, 2026

Claude finished @mwiewior's task in 1m 18s —— View job


Code Review

  • Read changed files
  • Review dependency updates
  • Review test changes
  • Review documentation updates
  • Review CHANGELOG
  • Provide feedback

Summary

This PR successfully integrates parsing-level projection pushdown for BAM/CRAM/VCF formats by bumping datafusion-bio-formats to rev 84605fcd. The implementation looks solid and the test coverage is comprehensive. The feature will provide significant performance improvements for queries that only need a subset of columns.

Strengths

  1. Clean dependency update: All 8 datafusion-bio-format-* crates are consistently updated to the same rev
  2. Comprehensive test coverage: Added 8 new tests covering:
    • BAM/CRAM projection correctness
    • COUNT(*) queries for all three formats
    • Physical plan verification via EXPLAIN
  3. Documentation is thorough:
    • Feature matrix updated correctly
    • New "Projection pushdown" section with clear examples
    • Helpful tip about COUNT(*) optimization
  4. Test cleanup: Un-skipped and simplified test_execution_plan_validation.py by switching from logical plan inspection to physical plan DisplayAs format

Issues & Suggestions

1. Minor: Inconsistent documentation formatting

In docs/features.md line 721, there's a missing space in the tip directive:

!!! tip
    `COUNT(*)` queries also benefit — when no columns are needed...

This is correct, but the indentation should be consistent with other tips in the file (4 spaces). The current version looks correct though.

2. Test organization consideration

tests/test_projection_pushdown.py has grown significantly (from 37 to 177 lines, +140 lines). Consider if some tests could be factored into helper functions or if the file should be split for better maintainability. Not a blocker, but something to keep in mind for future refactoring.

3. Documentation: Missing example for CRAM

The new projection pushdown section (docs/features.md:682-721) shows examples for BAM and VCF, but doesn't explicitly mention CRAM despite it being supported. Consider adding a brief note like:

# Works with CRAM too
df = pb.scan_cram("alignments.cram", projection_pushdown=True).select(["name", "chrom"]).collect()

4. Question: Default behavior

The PR description states the feature requires projection_pushdown=True, but it would be helpful to document in docs/features.md why this isn't the default. Is there a performance trade-off or breaking change concern? A brief note explaining the opt-in design would help users understand when to enable it.

5. Test file naming

test_execution_plan_validation.py now only contains a single test function. Consider renaming the file to better reflect its focused purpose, or adding a docstring at the module level explaining its scope.

Minor Observations

  1. CHANGELOG.md: The entry is well-written and clearly explains the benefits (line 39-42)
  2. Feature matrix update: Correctly changed BAM/CRAM/VCF projection pushdown from "❌" / ":construction:" to ":white_check_mark:"
  3. Test assertions: Good use of descriptive assertion messages with actual vs expected values
  4. Code quality: The regex pattern in extract_projected_columns_from_plan() correctly handles all three formats

Verification

I've reviewed:

  • ✅ Dependency version consistency across all 8 crates
  • ✅ Test correctness and coverage
  • ✅ Documentation accuracy and completeness
  • ✅ CHANGELOG entry quality
  • ✅ No obvious bugs or security issues

Recommendation

LGTM with minor suggestions. The core implementation is solid, tests are comprehensive, and documentation is clear. The suggestions above are non-blocking improvements that could be addressed in a follow-up PR if desired.

The parsing-level projection pushdown is a significant performance optimization, and this integration looks well-executed. Great work! 🎉


mwiewior and others added 2 commits February 9, 2026 18:43
Projection pushdown is now enabled by default for all I/O methods
(scan_*/read_*) and range operations (overlap, nearest, count_overlaps,
coverage, merge). Pass projection_pushdown=False to opt out.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback: add explicit CRAM usage example alongside
the existing BAM and VCF examples.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@mwiewior mwiewior merged commit a87fe8b into master Feb 9, 2026
13 checks passed
@mwiewior mwiewior deleted the feat/parsing-level-projection-pushdown branch February 9, 2026 18:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant