-
Notifications
You must be signed in to change notification settings - Fork 113
RLE gpu decompress kernel #4864
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
Codecov Report✅ All modified and coverable lines are covered by tests. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
0ax1
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couple of questions and remarks.
|
|
||
| [[package]] | ||
| name = "arrow-arith" | ||
| version = "55.2.0" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems unrelated.
|
RLE is not take... it's a very specific take where every 1024 indices are local to their value range so we need to slice values with offsets, that's why the result is incorrect. |
325ad89 to
96e9248
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
any reason to avoid floats in the ValueT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there's not, I was working on tests to improve the coverage
96e9248 to
123b09e
Compare
CodSpeed Performance ReportMerging #4864 will degrade performances by 10.68%Comparing Summary
Benchmarks breakdown
Footnotes |
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
Signed-off-by: Robert Kruszewski <[email protected]>
1850b32 to
2c81614
Compare
Pull Request is not mergeable
| __device__ void rle_decompress( | ||
| const IndicesT *__restrict indices_array, | ||
| const ValueT *__restrict values_array, | ||
| const OffsetsT *__restrict offsets, | ||
| ValueT *__restrict values_out | ||
| ) { | ||
| auto i = threadIdx.x; | ||
| auto block_offset = blockIdx.x * 1024; | ||
|
|
||
| auto indices = indices_array + block_offset; | ||
| auto out = values_out + block_offset; | ||
| auto values = values_array + offsets[blockIdx.x]; | ||
|
|
||
| const int thread_ops = 32; | ||
|
|
||
| for (auto j = 0; j < thread_ops; j++) { | ||
| auto idx = i * thread_ops + j; | ||
| out[idx] = values[indices[idx]]; | ||
| } | ||
| } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We will need a fl-transposed iterator order version?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would you mind adding a single fused kernel bp-rle. Where we fuse bitunpacking the indices and rle decoding.
It can be in a follow up, but while you understand how the decode works we should merge something showing that.
| } | ||
| } | ||
|
|
||
| #[allow(clippy::cognitive_complexity)] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should follow this hint. Can we pull out the inner match below?
| #[case::i64((-2000i64..2000).collect::<Buffer<i64>>())] | ||
| #[case::f32((-2000..2000).map(|i| i as f32).collect::<Buffer<f32>>())] | ||
| #[case::f64((-2000..2000).map(|i| i as f64).collect::<Buffer<f64>>())] | ||
| fn test_cuda_rle_decompress<T: NativePType>(#[case] values: Buffer<T>) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do we need to exercise the different offset types?
joseph-isaacs
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think its good to go, but it would be great to impl a fused bp-rle kernel so we can see how that might be done
|
I think its as simple as change teh indices iteration order to fl-transposed, but it might not be |
The minimal version of RLE kernel is the same as take, however, as we add
handling for masked kernels they will diverge
Signed-off-by: Robert Kruszewski [email protected]