perf(parquet): column parallelism + S3 byte range prefetching for arrow-rs reader#6353
Draft
desmondcheongzx wants to merge 11 commits intomainfrom
Draft
perf(parquet): column parallelism + S3 byte range prefetching for arrow-rs reader#6353desmondcheongzx wants to merge 11 commits intomainfrom
desmondcheongzx wants to merge 11 commits intomainfrom
Conversation
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #6353 +/- ##
==========================================
- Coverage 74.82% 74.44% -0.39%
==========================================
Files 1022 1021 -1
Lines 136386 137603 +1217
==========================================
+ Hits 102051 102435 +384
- Misses 34335 35168 +833
🚀 New features to boost your workflow:
|
Parallelize column decoding within each row group across all four read paths. Opens separate readers per column with ProjectionMask::roots, decodes independently, and hconcats results. Supports two-phase decode when predicates are pushed (serial predicate phase, parallel data phase with refined RowSelection). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Column-parallel decode opened the file independently for each column task, adding 16x open() syscall overhead on a 16-column file. Read the file into a bytes::Bytes buffer once and share it across column tasks via cheap Bytes::clone() (atomic refcount, zero-copy). Each column reader gets its own independent cursor over the shared buffer. This fixes the CodSpeed regression in test_show[1 Small File] where the per-column file opens added ~2.6ms overhead on a small 1024-row file. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Add MIN_RG_BYTES_FOR_COL_PARALLELISM threshold (16 MiB uncompressed) to fall back to decode_single_rg for small row groups where per-column reader overhead (metadata clones, buffer setup, hconcat) exceeds the benefit of parallel decode. Applied to both local streaming (Path 2) and local bulk (Path 1) read paths. The CodSpeed benchmark file (1024 rows, 16 cols, ~880KB uncompressed) now takes the single-reader fast path instead of spawning 16 column tasks. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…ndle I/O The previous approach read the entire file into a bytes::Bytes buffer upfront, which added ~380ms of overhead for a 728MB file before any decode work started. To fix this, each column task now opens its own file handle via File::open (~microsecond syscall, independent seek position). The OS page cache serves subsequent reads from memory, so there is no redundant I/O. This eliminated the upfront read bottleneck and brought all_cols 8RG from 1440ms to 990ms (parity with parquet2's 996ms). Additional threshold tuning: - MIN_COLS_FOR_COL_PARALLELISM = 3: routes 1-2 column reads to the simpler per-RG fallback path where column splitting overhead isn't justified. - RG count check (rg_tasks < num_cpus * 2): when row groups already saturate cores (e.g. 64 RGs on 8 cores), per-RG decode is more efficient than column splitting with its per-builder overhead. - Async paths use MIN_COLS_FOR_COL_PARALLELISM consistently. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
71f67f6 to
193b2ab
Compare
The arrow-rs parquet reader was 5-15x slower than parquet2 on S3 because each per-column async reader created its own DaftAsyncFileReader, so the ReadPlanner coalescing only saw one column's byte ranges at a time. To fix this, we compute ALL needed (RG, column) byte ranges upfront and feed them through a single ReadPlanner. The coalesced data is cached in an Arc<RangesContainer>, and each per-column reader gets a PrefetchedAsyncFileReader that serves get_byte_ranges() from cache with zero additional HTTP requests. Changes: - Add PrefetchedAsyncFileReader backed by pre-fetched RangesContainer - Extract build_read_planner_and_collect() as shared helper - Add prefetch_column_ranges() for bulk byte range pre-fetching - Add prefetched decode variants for predicate and column phases - Apply prefetching to all async paths (bulk + stream, fallback + col-parallel) - Remove now-unused DaftAsyncFileReader-based async decode helpers Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
The prefetched async decode helpers used DEFAULT_BATCH_SIZE (8192), causing arrow-rs to emit many small batches per column per RG that required an expensive concat_batches step (~15% of total S3 read time). Set batch_size to the RG row count so each column decodes in a single pass, eliminating the intermediate concat. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…oncat Previously, async column-parallel decode created N_RGs * N_cols tasks, each decoding one (RG, column) pair. The results required two concat layers: hconcat within each RG, then RecordBatch::concat across all RGs. The final concat alone copied ~728MB of data (~21% of total time). Now we create N_cols tasks, each running a single ParquetRecordBatchStream across ALL RGs. This mirrors parquet2's architecture: parallel across columns, sequential across RGs within each column. The result is one array per column spanning all RGs, assembled with a single hconcat (no data copy, just schema + array ref merge). For the predicate path, phase 1 (per-RG predicate decode) is unchanged since per-RG RowSelections are needed. Phase 2 concatenates the per-RG selections into one combined selection and passes it to per-column streams across all RGs. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
try_join_all polls all sub-futures from a single tokio task, so CPU-bound column decode runs pseudo-sequentially on one worker thread. tokio::spawn creates independent tasks that the work-stealing scheduler distributes across all available worker threads. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Move column decode tasks from the I/O runtime (DAFTIO, 8 threads) to the compute runtime (DAFTCPU, NUM_CPUS threads). After prefetching, column decode is pure CPU work - it should run on compute threads just like the local reader does with rayon on DAFTCPU. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Replace per-column streams (16-way parallelism) with per-(RG, col) tasks (N_RGs * N_cols parallelism) on the compute runtime. For 64 RGs and 16 columns, this gives 1024-way task parallelism instead of 16. Results are grouped by column and concat'd per-column in parallel, then hconcat'd once. This avoids the old per-RG hconcat + cross-RG concat pattern and matches the local reader's parallelism strategy. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
…tion Extract the per-(RG, col) spawn/collect/group/concat pattern into spawn_column_decode(), removing ~90 lines of duplicated logic between the predicate and no-predicate paths. Also fix a duplicate comment and merge redundant limit branches. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
This was referenced Mar 18, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The arrow-rs parquet reader was 15-25x slower than parquet2 on S3 reads due to three root causes: no cross-column I/O coalescing, pseudo-sequential async decode, and insufficient task parallelism. This PR closes the gap to parity.
S3 Benchmarks (TPC-H lineitem ~728MB, c7i.4xlarge)
Root Causes and Fixes
1. No cross-column I/O coalescing (15-25x slowdown)
Each per-column
DaftAsyncFileReaderran its ownReadPlanner, so coalescing only saw one column's byte ranges at a time. Parquet2 computed ALL column byte ranges upfront and coalesced them into fewer large HTTP requests.To fix this,
prefetch_column_ranges()computes all (RG, column) byte ranges and feeds them through a singleReadPlanner. The coalesced data is cached in anArc<RangesContainer>, and each per-column reader gets aPrefetchedAsyncFileReaderthat servesget_byte_ranges()from cache with zero HTTP requests.2. Pseudo-sequential async decode
try_join_allpolls all sub-futures from a single tokio task - CPU-bound column decode ran on one worker thread despite having 16 column tasks. Replaced withget_compute_runtime().spawn()per task, creating independent tasks that tokio's work-stealing scheduler distributes across all DAFTCPU threads (NUM_CPUS).3. Insufficient task parallelism for many-RG files
Per-column streams (16-way parallelism) forced each task to process all 64 RGs sequentially. Switched to per-(RG, col) tasks on the compute runtime, giving N_RGs * N_cols parallelism (1024 tasks for 64 RGs and 16 cols). Results are grouped by column, concat'd per-column in parallel, then hconcat'd once via
spawn_column_decode().Changes
src/daft-parquet/src/async_reader.rsPrefetchedAsyncFileReader:AsyncFileReaderimpl backed by pre-fetchedRangesContainerbuild_read_planner_and_collect(): sharedReadPlannersetup used by bothDaftAsyncFileReaderand the prefetch pathsrc/daft-parquet/src/arrowrs_reader.rsprefetch_column_ranges(): bulk byte range pre-fetching with cross-column/RG coalescingroot_to_leaf_columns(): maps root column indices to parquet leaf indicesdecode_rg_predicate_phase_async_prefetched()/decode_rg_column_async_prefetched(): prefetched decode variantsspawn_column_decode(): shared helper for per-(RG, col) task spawn, collect, group-by-column, parallel concatread_parquet_single_arrowrs(Path 3) andstream_parquet_single_arrowrs(Path 4) to use prefetching + compute runtime dispatchDaftAsyncFileReader-based async decode helpersIntra-RG Column Parallelism (pre-existing on branch)
Adds column-level parallelism within each row group across all four read paths (sync bulk, sync stream, async bulk, async stream). For wide tables with few row groups, this parallelizes across columns rather than just across RGs.