feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288
feat: Add parsing-level projection pushdown for BAM/CRAM/VCF#288
Conversation
Bump datafusion-bio-formats to rev 84605fcd (upstream PR #64) which adds parsing-level projection pushdown. Unprojected fields are now skipped entirely during record parsing — no string formatting, sequence decoding, or memory allocation for unused columns. - Update all 8 datafusion-bio-format-* crate revs in Cargo.toml - Un-skip and rewrite EXPLAIN plan tests to use physical plan DisplayAs format (e.g. BamExec: projection=[name, chrom]) - Add new tests: BAM/CRAM projection correctness, COUNT(*) for all three formats, EXPLAIN plan verification for BAM/CRAM/VCF - Update features.md table and add projection pushdown documentation - Update CHANGELOG.md Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
|
@claude pls review |
|
Claude finished @mwiewior's task in 1m 18s —— View job Code Review
SummaryThis PR successfully integrates parsing-level projection pushdown for BAM/CRAM/VCF formats by bumping Strengths
Issues & Suggestions1. Minor: Inconsistent documentation formattingIn !!! tip
`COUNT(*)` queries also benefit — when no columns are needed...This is correct, but the indentation should be consistent with other tips in the file (4 spaces). The current version looks correct though. 2. Test organization consideration
3. Documentation: Missing example for CRAMThe new projection pushdown section (docs/features.md:682-721) shows examples for BAM and VCF, but doesn't explicitly mention CRAM despite it being supported. Consider adding a brief note like: # Works with CRAM too
df = pb.scan_cram("alignments.cram", projection_pushdown=True).select(["name", "chrom"]).collect()4. Question: Default behaviorThe PR description states the feature requires 5. Test file naming
Minor Observations
VerificationI've reviewed:
RecommendationLGTM with minor suggestions. The core implementation is solid, tests are comprehensive, and documentation is clear. The suggestions above are non-blocking improvements that could be addressed in a follow-up PR if desired. The parsing-level projection pushdown is a significant performance optimization, and this integration looks well-executed. Great work! 🎉 |
Projection pushdown is now enabled by default for all I/O methods (scan_*/read_*) and range operations (overlap, nearest, count_overlaps, coverage, merge). Pass projection_pushdown=False to opt out. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Address review feedback: add explicit CRAM usage example alongside the existing BAM and VCF examples. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Summary
datafusion-bio-formatsto rev84605fcd(upstream PR #64) which adds parsing-level projection pushdown for BAM, CRAM, and VCF formatsCOUNT(*)queries use an empty projection path, avoiding parsing any fields while still counting records correctlyChanges
datafusion-bio-format-*crate revsDisplayAsformat (VcfExec: projection=[chrom, start])Test plan
cargo checkpassesmaturin develop --releasebuilds successfullytest_projection_pushdown.py+test_execution_plan_validation.py)test_io_bam.py,test_io_cram.py,test_io_vcf.py)🤖 Generated with Claude Code