GH-47895: [C++][Parquet] Add prolog and epilog in unpack#47896
GH-47895: [C++][Parquet] Add prolog and epilog in unpack#47896pitrou merged 12 commits intoapache:mainfrom
Conversation
|
|
|
@pitrou this is ready for review (waiting for CI to finish here). With this we could also investigate removing the |
|
There's a sanitizer failure that needs fixing here: (I suppose it happens when length == 0...) |
Perhaps that can even be done in this PR? It doesn't sound very complicated... |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 600696c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit 600696c. There weren't enough matching historic benchmark results to make a call on whether there were regressions. The full Conbench report has more details. |
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 287f136. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 287f136. There were 7 benchmark results indicating a performance regression:
The full Conbench report has more details. |
pitrou
left a comment
There was a problem hiding this comment.
Some comments on the implementation. I haven't looked at the bpacking tests.
| const int spread = byte_end - byte_start + 1; | ||
| max = spread > max ? spread : max; | ||
| start += width; | ||
| } while (start % 8 != bit_offset); |
There was a problem hiding this comment.
Note that this will be an infinite loop if bit_offset >= 8 (hence the DCHECK suggestion below)
There was a problem hiding this comment.
Indeed, though that function is never used at runtime, only compile time.
|
@pitrou removing the For reference: that
|
|
Ah, sorry! Let's just restore it then :) |
|
I never pushed it, this was on a local benchmark. |
| // Easy case to handle, simply setting memory to zero. | ||
| return unpack_null(in, out, batch_size); | ||
| } else { | ||
| // In case of misalignment, we need to run the prolog until aligned. |
There was a problem hiding this comment.
As a TODO, if batch_size is large enough, we can perhaps rewind to the last byte-aligned packed and SIMD-unpack kValuesUnpacked into a local buffer, instead of going through unpack_exact.
(this seems lower-priority than SIMD shuffling, though)
There was a problem hiding this comment.
Here my suspicion is that we rarely pass unaligned inputs in practice.
There was a problem hiding this comment.
That's a reasonable assumption indeed. We could later trace through the test suite (add logs for example?) and see whether that actually happens.
| ARROW_DCHECK_GE(batch_size, 0); | ||
| ARROW_COMPILER_ASSUME(batch_size < kValuesUnpacked); | ||
| ARROW_COMPILER_ASSUME(batch_size >= 0); | ||
| unpack_exact<kPackedBitWidth, false>(in, out, batch_size, /* bit_offset= */ 0); |
There was a problem hiding this comment.
Similarly, if there's enough padding at the end of the input, we could SIMD-unpack a full kValuesUnpacked into a local buffer.
There was a problem hiding this comment.
This one I thought about, but it would require passing another parameter to know where the input buffer ends (regardless of the batch_size). We could also know where the output buffer ends and unpack to even skip the local buffer.
I think we should investigate this and only do it if there is a benefit. Though that must be after the (hopefully) changes to the SIMD shuffles because they change the size of the iterations.
There was a problem hiding this comment.
Well, if we want to use SIMD shuffles maximally, we probably want to pass the input buffer end indeed.
But that can be done later.
| switch (num_bits) { | ||
| case 0: | ||
| return unpack_null(in, out, batch_size); | ||
| return unpack_width<0, Unpacker>(in, out, batch_size, bit_offset); |
There was a problem hiding this comment.
Ok, macros are not pretty, but we could have a macro here to minimize diffs when changing these function signatures :-)
Such as:
#define CASE_UNPACK_WIDTH(_width) \
return unpack_width<_width, Unpacker>(in, out, batch_size, bit_offset)
if constexpr (std::is_same_v<UnpackedUint, bool>) {
switch (num_bits) {
case 0:
CASE_UNPACK_WIDTH(0);
// etc.
#undef CASE_UNPACK_WIDTHAs you prefer, though.
There was a problem hiding this comment.
I don't mind if someone wants to change it but I'm terrible at writing them.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
0e0d650 to
53453c8
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 53453c8 Submitted crossbow builds: ursacomputing/crossbow @ actions-add562e242 |
pitrou
left a comment
There was a problem hiding this comment.
+1 from me. Really neat work @AntoinePrv !
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ff1f71d. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 68 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
unpackWhat changes are included in this PR?
unpackextract exactly the required number of values -> change return type tovoid.unpackcan handled non aligned data -> includebit_offsetin input parameters.unpacktests.Decoder benchmarks should remain the same (tested linux x86-64).
I have not benchmark the
unpackfunctions themselves but I don't believe it's relevant since they now do more work.Are these changes tested?
Yes
Are there any user-facing changes?
No