Skip to content

Conversation

@robert3005
Copy link
Contributor

The minimal version of RLE kernel is the same as take, however, as we add
handling for masked kernels they will diverge

Signed-off-by: Robert Kruszewski [email protected]

@robert3005 robert3005 added the feature Release label indicating a new feature or request label Oct 7, 2025
@codecov
Copy link

codecov bot commented Oct 7, 2025

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 87.46%. Comparing base (487e8ce) to head (2c81614).
⚠️ Report is 2 commits behind head on develop.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@0ax1 0ax1 self-requested a review October 7, 2025 10:18
Copy link
Contributor

@0ax1 0ax1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple of questions and remarks.


[[package]]
name = "arrow-arith"
version = "55.2.0"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems unrelated.

@robert3005
Copy link
Contributor Author

RLE is not take... it's a very specific take where every 1024 indices are local to their value range so we need to slice values with offsets, that's why the result is incorrect.

@robert3005 robert3005 force-pushed the rk/rlekernel branch 3 times, most recently from 325ad89 to 96e9248 Compare October 8, 2025 00:34
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any reason to avoid floats in the ValueT?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there's not, I was working on tests to improve the coverage

@robert3005 robert3005 enabled auto-merge (squash) October 9, 2025 14:20
@codspeed-hq
Copy link

codspeed-hq bot commented Oct 9, 2025

CodSpeed Performance Report

Merging #4864 will degrade performances by 10.68%

Comparing rk/rlekernel (2c81614) with develop (943c4c3)1

Summary

❌ 4 regressions
✅ 1168 untouched

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

Benchmark BASE HEAD Change
take_indices[(1000, 16, 0.005)] 20.6 µs 22.9 µs -10.22%
take_indices[(1000, 256, 0.005)] 19.8 µs 22.2 µs -10.68%
take_indices[(1000, 256, 0.01)] 20 µs 22.3 µs -10.48%
take_indices[(1000, 256, 0.03)] 20.5 µs 22.9 µs -10.24%

Footnotes

  1. No successful run was found on develop (487e8ce) during the generation of this report, so 943c4c3 was used instead as the comparison base. There might be some changes unrelated to this pull request in this report.

Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
auto-merge was automatically disabled October 9, 2025 15:32

Pull Request is not mergeable

Comment on lines +9 to +28
__device__ void rle_decompress(
const IndicesT *__restrict indices_array,
const ValueT *__restrict values_array,
const OffsetsT *__restrict offsets,
ValueT *__restrict values_out
) {
auto i = threadIdx.x;
auto block_offset = blockIdx.x * 1024;

auto indices = indices_array + block_offset;
auto out = values_out + block_offset;
auto values = values_array + offsets[blockIdx.x];

const int thread_ops = 32;

for (auto j = 0; j < thread_ops; j++) {
auto idx = i * thread_ops + j;
out[idx] = values[indices[idx]];
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We will need a fl-transposed iterator order version?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you mind adding a single fused kernel bp-rle. Where we fuse bitunpacking the indices and rle decoding.

It can be in a follow up, but while you understand how the decode works we should merge something showing that.

}
}

#[allow(clippy::cognitive_complexity)]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should follow this hint. Can we pull out the inner match below?

#[case::i64((-2000i64..2000).collect::<Buffer<i64>>())]
#[case::f32((-2000..2000).map(|i| i as f32).collect::<Buffer<f32>>())]
#[case::f64((-2000..2000).map(|i| i as f64).collect::<Buffer<f64>>())]
fn test_cuda_rle_decompress<T: NativePType>(#[case] values: Buffer<T>) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do we need to exercise the different offset types?

Copy link
Contributor

@joseph-isaacs joseph-isaacs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think its good to go, but it would be great to impl a fused bp-rle kernel so we can see how that might be done

@joseph-isaacs
Copy link
Contributor

I think its as simple as change teh indices iteration order to fl-transposed, but it might not be

@robert3005 robert3005 merged commit 1ec85ef into develop Oct 9, 2025
47 of 48 checks passed
@robert3005 robert3005 deleted the rk/rlekernel branch October 9, 2025 22:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

feature Release label indicating a new feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants