RLE gpu decompress kernel #4864

robert3005 · 2025-10-07T10:06:44Z

The minimal version of RLE kernel is the same as take, however, as we add
handling for masked kernels they will diverge

Signed-off-by: Robert Kruszewski [email protected]

codecov · 2025-10-07T10:16:39Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.46%. Comparing base (487e8ce) to head (2c81614).
⚠️ Report is 2 commits behind head on develop.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

0ax1

Couple of questions and remarks.

vortex-gpu/src/bit_unpack.rs

vortex-gpu/src/rle_decompress.rs

0ax1 · 2025-10-07T11:20:59Z

java/testfiles/Cargo.lock


 [[package]]
 name = "arrow-arith"
-version = "55.2.0"


This seems unrelated.

vortex-gpu/src/rle_decompress.rs

robert3005 · 2025-10-07T12:31:47Z

RLE is not take... it's a very specific take where every 1024 indices are local to their value range so we need to slice values with offsets, that's why the result is incorrect.

vortex-gpu/src/rle_decompress.rs

joseph-isaacs · 2025-10-08T12:19:45Z

vortex-gpu/kernels/rle_decompress.cu

any reason to avoid floats in the ValueT?

there's not, I was working on tests to improve the coverage

codspeed-hq · 2025-10-09T14:27:17Z

CodSpeed Performance Report

Merging #4864 will degrade performances by 10.68%

_{Comparing rk/rlekernel (2c81614) with develop (943c4c3)¹}

Summary

❌ 4 regressions
✅ 1168 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`BASE`	`HEAD`	Change
❌	`take_indices[(1000, 16, 0.005)]`	20.6 µs	22.9 µs	-10.22%
❌	`take_indices[(1000, 256, 0.005)]`	19.8 µs	22.2 µs	-10.68%
❌	`take_indices[(1000, 256, 0.01)]`	20 µs	22.3 µs	-10.48%
❌	`take_indices[(1000, 256, 0.03)]`	20.5 µs	22.9 µs	-10.24%

No successful run was found on develop (487e8ce) during the generation of this report, so 943c4c3 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report. ↩

Signed-off-by: Robert Kruszewski <[email protected]>

joseph-isaacs · 2025-10-09T16:01:33Z

vortex-gpu/kernels/rle_decompress.cu

+__device__ void rle_decompress(
+    const IndicesT *__restrict indices_array,
+    const ValueT *__restrict values_array,
+    const OffsetsT *__restrict offsets,
+    ValueT *__restrict values_out
+) {
+    auto i = threadIdx.x;
+    auto block_offset = blockIdx.x * 1024;
+
+    auto indices = indices_array + block_offset;
+    auto out = values_out + block_offset;
+    auto values = values_array + offsets[blockIdx.x];
+
+    const int thread_ops = 32;
+
+    for (auto j = 0; j < thread_ops; j++) {
+        auto idx = i * thread_ops + j;
+        out[idx] = values[indices[idx]];
+    }
+}


We will need a fl-transposed iterator order version?

Would you mind adding a single fused kernel bp-rle. Where we fuse bitunpacking the indices and rle decoding.

It can be in a follow up, but while you understand how the decode works we should merge something showing that.

joseph-isaacs · 2025-10-09T16:10:28Z

vortex-gpu/src/rle_decompress.rs

+    }
+}
+
+#[allow(clippy::cognitive_complexity)]


I think we should follow this hint. Can we pull out the inner match below?

joseph-isaacs · 2025-10-09T16:12:04Z

vortex-gpu/src/rle_decompress.rs

+    #[case::i64((-2000i64..2000).collect::<Buffer<i64>>())]
+    #[case::f32((-2000..2000).map(|i| i as f32).collect::<Buffer<f32>>())]
+    #[case::f64((-2000..2000).map(|i| i as f64).collect::<Buffer<f64>>())]
+    fn test_cuda_rle_decompress<T: NativePType>(#[case] values: Buffer<T>) {


do we need to exercise the different offset types?

joseph-isaacs

I think its good to go, but it would be great to impl a fused bp-rle kernel so we can see how that might be done

joseph-isaacs · 2025-10-09T16:13:14Z

I think its as simple as change teh indices iteration order to fl-transposed, but it might not be

robert3005 added the feature Release label indicating a new feature or request label Oct 7, 2025

0ax1 self-requested a review October 7, 2025 10:18

0ax1 reviewed Oct 7, 2025

View reviewed changes

java/testfiles/Cargo.lock

[[package]]

name = "arrow-arith"

version = "55.2.0"

Copy link

Contributor

0ax1 Oct 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated.

0ax1 approved these changes Oct 7, 2025

View reviewed changes

joseph-isaacs reviewed Oct 7, 2025

View reviewed changes

vortex-gpu/src/rle_decompress.rs Show resolved Hide resolved

joseph-isaacs reviewed Oct 7, 2025

View reviewed changes

vortex-gpu/src/rle_decompress.rs Outdated Show resolved Hide resolved

0ax1 reviewed Oct 7, 2025

View reviewed changes

vortex-gpu/src/rle_decompress.rs Show resolved Hide resolved

robert3005 force-pushed the rk/rlekernel branch 3 times, most recently from 325ad89 to 96e9248 Compare October 8, 2025 00:34

joseph-isaacs reviewed Oct 8, 2025

View reviewed changes

robert3005 force-pushed the rk/rlekernel branch from 96e9248 to 123b09e Compare October 9, 2025 13:02

robert3005 enabled auto-merge (squash) October 9, 2025 14:20

robert3005 added 7 commits October 9, 2025 16:00

RLE gpu decompress kernel

82d814a

Signed-off-by: Robert Kruszewski <[email protected]>

asserts

0250252

Signed-off-by: Robert Kruszewski <[email protected]>

tests

e66ecc9

Signed-off-by: Robert Kruszewski <[email protected]>

less

e9d2c10

Signed-off-by: Robert Kruszewski <[email protected]>

adapt

5bf024c

Signed-off-by: Robert Kruszewski <[email protected]>

length

81c6012

Signed-off-by: Robert Kruszewski <[email protected]>

test

2c81614

Signed-off-by: Robert Kruszewski <[email protected]>

robert3005 force-pushed the rk/rlekernel branch from 1850b32 to 2c81614 Compare October 9, 2025 15:26

auto-merge was automatically disabled October 9, 2025 15:32
Pull Request is not mergeable

joseph-isaacs reviewed Oct 9, 2025

View reviewed changes

joseph-isaacs approved these changes Oct 9, 2025

View reviewed changes

robert3005 merged commit 1ec85ef into develop Oct 9, 2025
47 of 48 checks passed

robert3005 deleted the rk/rlekernel branch October 9, 2025 22:18

RLE gpu decompress kernel #4864

RLE gpu decompress kernel #4864

Uh oh!

Conversation

robert3005 commented Oct 7, 2025

Uh oh!

codecov bot commented Oct 7, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

0ax1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

robert3005 commented Oct 7, 2025

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

codspeed-hq bot commented Oct 9, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Merging #4864 will degrade performances by 10.68%

Summary

Benchmarks breakdown

Footnotes

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs left a comment

Choose a reason for hiding this comment

Uh oh!

joseph-isaacs commented Oct 9, 2025

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

codecov bot commented Oct 7, 2025 •

edited

Loading

codspeed-hq bot commented Oct 9, 2025 •

edited

Loading