Skip to content

PatchedArray: basics and wiring#7002

Open
a10y wants to merge 23 commits intodevelopfrom
aduffy/patched-array
Open

PatchedArray: basics and wiring#7002
a10y wants to merge 23 commits intodevelopfrom
aduffy/patched-array

Conversation

@a10y
Copy link
Copy Markdown
Contributor

@a10y a10y commented Mar 17, 2026

Summary

This is the first PR in a series addressing the PatchedArray RFC: vortex-data/rfcs#27

This PR adds a new PatchedArray array variant, which is slated to only be used with FastLanes array types. The design is largely documented in the RFC, but briefly

  • Wraps another child array
  • Holds a buffer of offsets and patch values. They are both sorted by chunk/lane so there is efficient random as well as bulk access for CUDA

We are able to pushdown the following at reduce time:

  • Filters pushes through to the child, and we trim the patch indices to the range of chunks covered by the filter
  • Slicing that reduces the chunk range
  • Compare always pushes through to the child
  • Take pushes through to the inner then constructs a hashmap of the patches

There will be follow ups to add the wiring into CUDA and to update how BitPacked and ALP arrays are written.

Testing

There are unit tests for all of the reducers and kernels

@codspeed-hq
Copy link
Copy Markdown

codspeed-hq bot commented Mar 17, 2026

Merging this PR will degrade performance by 24.47%

❌ 11 regressed benchmarks
✅ 1095 untouched benchmarks
⏩ 1522 skipped benchmarks1

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Performance Changes

Mode Benchmark BASE HEAD Efficiency
Simulation new_bp_prim_test_between[i32, 16384] 101.3 µs 116.1 µs -12.71%
Simulation new_bp_prim_test_between[i32, 32768] 148.9 µs 178.1 µs -16.41%
Simulation new_alp_prim_test_between[f64, 16384] 130.7 µs 152.8 µs -14.43%
Simulation new_bp_prim_test_between[i64, 32768] 182 µs 241 µs -24.47%
Simulation new_bp_prim_test_between[i64, 16384] 118 µs 147.6 µs -20.06%
Simulation old_alp_prim_test_between[f32, 32768] 398.3 µs 443.2 µs -10.13%
Simulation take_10k_random 209 µs 267.3 µs -21.8%
Simulation take_10k_contiguous 270.4 µs 329.3 µs -17.9%
Simulation old_bp_prim_test_between[i64, 32768] 473.8 µs 534.1 µs -11.3%
Simulation bitwise_not_vortex_buffer_mut[1024] 477.2 ns 535.6 ns -10.89%
Simulation bitwise_not_vortex_buffer_mut[128] 317.8 ns 376.1 ns -15.51%

Comparing aduffy/patched-array (a9ce27f) with develop (60847b0)

Open in CodSpeed

Footnotes

  1. 1522 benchmarks were skipped, so the baseline results were used instead. If they were deleted from the codebase, click here and archive them to remove them from the performance reports.

@a10y a10y added the changelog/feature A new feature label Mar 18, 2026
@a10y a10y marked this pull request as ready for review March 18, 2026 21:21
@a10y a10y requested a review from joseph-isaacs March 18, 2026 21:21
Comment on lines +51 to +53
pub(super) indices: BufferHandle,
/// patch values corresponding to the indices. The ptype is specified by `values_ptype`.
pub(super) values: BufferHandle,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really think we should reconsider this.

This array (+ constant) should be used to replace the sparse array so I think we really want to store things other than PTypes?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure if this array is well-suited for non-primitive types. How many lanes does a List<Struct{u32,utf8}> have? I was tempted to actually put this into the vortex-fastlanes crate tbh

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i've updated to use child array for values, which should give us some flexibility to change things in the future

@a10y a10y force-pushed the aduffy/patched-array branch from c3d7434 to 6c1d7aa Compare March 19, 2026 13:24
@a10y a10y force-pushed the aduffy/patched-array branch from 860cc64 to 042394b Compare March 24, 2026 19:17
@a10y a10y force-pushed the aduffy/patched-array branch 4 times, most recently from df8dbf5 to 862d814 Compare March 26, 2026 19:54
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude review please

@claude

This comment was marked as resolved.

@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude i have addressed your main comments. please give another review.

@claude

This comment was marked as resolved.

@a10y a10y force-pushed the aduffy/patched-array branch from daf258f to 578d83d Compare March 27, 2026 14:35
@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude please review

@claude

This comment was marked as resolved.

@a10y
Copy link
Copy Markdown
Contributor Author

a10y commented Mar 27, 2026

@claude PR is ready for another review

@claude
Copy link
Copy Markdown
Contributor

claude bot commented Mar 27, 2026

Claude finished @a10y's task in 8m 28s —— View job


PR Review: PatchedArray basics and wiring

Tasks:

  • Read PR context and understand the changes
  • Read core array implementation files
  • Read compute kernel implementations
  • Read vtable implementation
  • Read test/compat fixtures
  • Provide comprehensive review feedback

Overall this is solid work — the design is clean, the test coverage is good, and the key operations (slice, filter, take, compare, scalar_at) are all well-structured. I found three correctness bugs of varying severity. Details below.


🔴 Bug 1: build() computes n_chunks incorrectly for sliced arrays

File: vortex-array/src/arrays/patched/vtable/mod.rs, line 246

let n_chunks = len.div_ceil(1024);  // WRONG for offset > 0

When a PatchedArray is sliced at a non-chunk boundary, array.offset > 0. For example, a slice from element 5 to element 1025 produces len = 1020, offset = 5, but the array spans 2 chunks. The lane_offsets buffer has n_chunks_actual * n_lanes + 1 = 2 * n_lanes + 1 entries, but build() reconstructs with n_chunks = 1020.div_ceil(1024) = 1, causing all downstream chunk/lane indexing to be wrong after a round-trip through serialization.

Fix: let n_chunks = (len + metadata.offset as usize).div_ceil(1024);

This matches how slice_chunks counts chunks and how SliceReduce selects chunk boundaries. All the invariants hold because n_chunks is always derivable from (len + offset).

Fix this →


🔴 Bug 2: Missing upper-bounds check in compare.rs's ApplyPatches::apply

File: vortex-array/src/arrays/patched/compute/compare.rs, lines 139–149

let bit_index = chunk * 1024 + patch_index as usize;
if bit_index < self.offset {
    continue;  // lower-bound guard ✓
}
let bit_index = bit_index - self.offset;
// No upper-bound check — can write past end of bits buffer!
if cmp(patch_value, self.constant) {
    self.bits.set(bit_index)
} else {
    self.bits.unset(bit_index)
}

The bits buffer has lhs.len bits. A patch in the final chunk at a position beyond offset + len (which can easily happen when len is not a multiple of 1024 — e.g., 2048-element array with a patch at index 2047 sliced to [0, 2040)) causes bit_index = 2047 - 0 = 2047 >= 2040, an out-of-bounds write.

Fix: Add if bit_index >= self.bits.len() { continue; } (or use self.offset + lhs_len as an explicit upper bound) after computing bit_index - self.offset. The equivalent guard already exists in apply_patches_primitive (line 369 of vtable/mod.rs).

Fix this →


🔴 Bug 3: Double-offset in slice_chunks when chunks.start == 0 and offset > 0

File: vortex-array/src/arrays/patched/array.rs, lines 138–168

let begin = (chunks.start * 1024).max(self.offset);  // = self.offset when start=0
let inner = self.inner.slice(begin..end)?;            // inner[0] = original[self.offset]
let offset = begin % 1024;                            // = self.offset again

When chunks.start == 0 and self.offset > 0 (i.e. filtering a non-chunk-boundary-sliced array that doesn't prune front chunks), the inner gets sliced by self.offset bytes AND the new offset field is still self.offset. During execute(), buffer.slice_typed(offset..offset+len) then skips another self.offset elements — effectively double-counting.

This path is reachable via filter.rs: when chunk_start == 0 && chunk_stop < n_chunks, slice_chunks(0..chunk_stop) is called. The existing test_filter_sliced test avoids this because it slices at chunk boundary 1024 (offset=0).

Fix: Distinguish the two cases:

  • When chunks.start > 0: begin = chunks.start * 1024, new offset = 0 (chunk-aligned start)
  • When chunks.start == 0: begin = 0, new offset = self.offset, new len = min(chunks.end * 1024, self.offset + self.len) - self.offset

Fix this →


🟡 Memory: slice_chunks retains all patch data

File: array.rs, lines 145–146

let indices = self.indices.clone();
let values = self.values.clone();

After slice_chunks, the indices and values arrays still hold patches for the entire original array, not just the retained chunk range. The lane_offsets correctly limits access, so this is not a correctness issue, but after a selective filter on a large array, all patches are still in memory. Worth a // TODO: trim indices and values to the patch range of lane_offsets[0]..lane_offsets[last] comment at least.


🟡 take_map builds the full patch hashmap even for selective takes

File: compute/take.rs, lines 92–107

for chunk in 0..n_chunks {
    for lane in 0..n_lanes {
        // ... inserts all patches into hashmap
    }
}

This scans all patches regardless of which elements are being taken. For a sparse take on a large array with many patches, it might be worth looping over the take indices first to identify which chunks are relevant, then only loading patches for those chunks. Not a blocker, but worth a // TODO for future optimization.


✅ Things that look good

  • The transpose / scalar_at lane computation is consistent (both use absolute_index % n_lanes, where absolute_index is relative to the pre-offset chunk start, and n_lanes always divides 1024, so chunk boundaries don't break lane alignment).
  • apply_patches_primitive correctly guards against both lower and upper out-of-bounds patches.
  • The SliceReduce implementation is elegant — slicing always to chunk boundaries and tracking the sub-chunk offset is exactly right.
  • The compat fixture is good, covering multi-chunk with patches in different chunks.
  • Test coverage is thorough for the happy paths: rstest parameterization in slice.rs is especially nice.

@a10y

This comment was marked as resolved.

a10y added 5 commits March 27, 2026 15:08
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
a10y added 18 commits March 27, 2026 15:08
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
We call append_to_builder on the inner first, then just do a single pass
and overwrite what it just wrote.

Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
Signed-off-by: Andrew Duffy <andrew@a10y.dev>
@a10y a10y force-pushed the aduffy/patched-array branch from 8648244 to a9ce27f Compare March 27, 2026 19:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

changelog/feature A new feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants