GH-47895: [C++][Parquet] Add prolog and epilog in unpack by AntoinePrv · Pull Request #47896 · apache/arrow

AntoinePrv · 2025-10-21T15:01:24Z

Rationale for this change

Simplify the use of unpack
Reduce code spread for unpacking integers

What changes are included in this PR?

epilog: unpack extract exactly the required number of values -> change return type to void.
prolog: unpack can handled non aligned data -> include bit_offset in input parameters.
Include prolog/epilog cases in unpack tests.
Simplify a roundtrip test from packed -> unpacked -> packed to unpacked -> packed -> unpacked

Decoder benchmarks should remain the same (tested linux x86-64).
I have not benchmark the unpack functions themselves but I don't believe it's relevant since they now do more work.

Are these changes tested?

Yes

Are there any user-facing changes?

No

GitHub Issue: [C++][Parquet] Unpack function epilog #47895

github-actions · 2025-10-21T15:01:58Z

⚠️ GitHub issue #47895 has been automatically assigned in GitHub to PR creator.

AntoinePrv · 2025-10-21T15:40:09Z

@pitrou this is ready for review (waiting for CI to finish here).

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

pitrou · 2025-10-21T15:42:44Z

There's a sanitizer failure that needs fixing here:
https://github.com/apache/arrow/actions/runs/18688262707/job/53286800716?pr=47896#step:7:8798

(I suppose it happens when length == 0...)

pitrou · 2025-10-21T15:50:03Z

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

Perhaps that can even be done in this PR? It doesn't sound very complicated...

pitrou · 2025-10-21T15:50:30Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-10-21T15:50:37Z

Benchmark runs are scheduled for commit 600696c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-10-21T17:33:52Z

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit 600696c.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

pitrou · 2025-10-22T09:53:32Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-10-22T09:53:38Z

Benchmark runs are scheduled for commit 287f136. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-10-22T12:50:53Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 287f136.

There were 7 benchmark results indicating a performance regression:

Pull Request Run on arm64-t4g-2xlarge-linux at 2025-10-22 11:18:34Z
- ReadMmapCachedFileAsync (C++) with params=num_cols:512/is_partial:0/real_time, source=cpp-micro, suite=arrow-ipc-read-write-benchmark
- ReadMmapCachedFile (C++) with params=num_cols:512/is_partial:0/real_time, source=cpp-micro, suite=arrow-ipc-read-write-benchmark
and 5 more (see the report linked below)

The full Conbench report has more details.

pitrou

Some comments on the implementation. I haven't looked at the bpacking tests.

cpp/src/arrow/util/bit_util.h

cpp/src/arrow/util/bpacking_dispatch_internal.h

pitrou · 2025-10-22T14:56:43Z

cpp/src/arrow/util/bpacking_dispatch_internal.h

+    const int spread = byte_end - byte_start + 1;
+    max = spread > max ? spread : max;
+    start += width;
+  } while (start % 8 != bit_offset);


Note that this will be an infinite loop if bit_offset >= 8 (hence the DCHECK suggestion below)

Indeed, though that function is never used at runtime, only compile time.

cpp/src/arrow/util/bpacking_dispatch_internal.h

AntoinePrv · 2025-10-23T09:17:55Z

@pitrou removing the MaxSpread constexpr logic did not perform well. Up to -20% on decoding benchmarks.

For reference: that MaxSpread metric is central to the shuffle SIMD algorithm I'm working on:

If small: we can "pack" multiple values in the shuffle and reuse it with multiple rshifts
If very large: we have to do something radically different for packed values that spread over >8 bytes

pitrou · 2025-10-23T09:19:32Z

Ah, sorry! Let's just restore it then :)

AntoinePrv · 2025-10-23T09:34:54Z

I never pushed it, this was on a local benchmark.

pitrou · 2025-10-28T15:42:19Z

cpp/src/arrow/util/bpacking_dispatch_internal.h

+    // Easy case to handle, simply setting memory to zero.
+    return unpack_null(in, out, batch_size);
+  } else {
+    // In case of misalignment, we need to run the prolog until aligned.


As a TODO, if batch_size is large enough, we can perhaps rewind to the last byte-aligned packed and SIMD-unpack kValuesUnpacked into a local buffer, instead of going through unpack_exact.

(this seems lower-priority than SIMD shuffling, though)

Here my suspicion is that we rarely pass unaligned inputs in practice.

That's a reasonable assumption indeed. We could later trace through the test suite (add logs for example?) and see whether that actually happens.

pitrou · 2025-10-28T15:43:16Z

cpp/src/arrow/util/bpacking_dispatch_internal.h

+      ARROW_DCHECK_GE(batch_size, 0);
+      ARROW_COMPILER_ASSUME(batch_size < kValuesUnpacked);
+      ARROW_COMPILER_ASSUME(batch_size >= 0);
+      unpack_exact<kPackedBitWidth, false>(in, out, batch_size, /* bit_offset= */ 0);


Similarly, if there's enough padding at the end of the input, we could SIMD-unpack a full kValuesUnpacked into a local buffer.

This one I thought about, but it would require passing another parameter to know where the input buffer ends (regardless of the batch_size). We could also know where the output buffer ends and unpack to even skip the local buffer.

I think we should investigate this and only do it if there is a benefit. Though that must be after the (hopefully) changes to the SIMD shuffles because they change the size of the iterations.

Well, if we want to use SIMD shuffles maximally, we probably want to pass the input buffer end indeed.

But that can be done later.

pitrou · 2025-10-28T15:47:49Z

cpp/src/arrow/util/bpacking_dispatch_internal.h

    switch (num_bits) {
      case 0:
-        return unpack_null(in, out, batch_size);
+        return unpack_width<0, Unpacker>(in, out, batch_size, bit_offset);


Ok, macros are not pretty, but we could have a macro here to minimize diffs when changing these function signatures :-)

Such as:

#define CASE_UNPACK_WIDTH(_width) \ return unpack_width<_width, Unpacker>(in, out, batch_size, bit_offset) if constexpr (std::is_same_v<UnpackedUint, bool>) { switch (num_bits) { case 0: CASE_UNPACK_WIDTH(0); // etc. #undef CASE_UNPACK_WIDTH

As you prefer, though.

I don't mind if someone wants to change it but I'm terrible at writing them.

cpp/src/arrow/util/bpacking_test.cc

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

pitrou · 2025-10-29T09:46:30Z

@github-actions crossbow submit -g cpp

github-actions · 2025-10-29T09:49:01Z

Revision: 53453c8

Submitted crossbow builds: ursacomputing/crossbow @ actions-add562e242

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

pitrou

+1 from me. Really neat work @AntoinePrv !

conbench-apache-arrow · 2025-10-29T21:23:01Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ff1f71d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 68 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting review Awaiting review labels Oct 21, 2025

AntoinePrv changed the title ~~GH-47895: [C++][Parquet] Add prolog and eiplog in unpack~~ GH-47895: [C++][Parquet] Add prolog and epilog in unpack Oct 21, 2025

pitrou requested changes Oct 22, 2025

View reviewed changes

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 22, 2025

pitrou reviewed Oct 22, 2025

View reviewed changes

cpp/src/arrow/util/bpacking_dispatch_internal.h Show resolved Hide resolved

pitrou reviewed Oct 28, 2025

View reviewed changes

AntoinePrv added 9 commits October 29, 2025 10:26

Add unpack epilogue

a6aaad4

Try smaller integer sizes

c669b77

void return type

77fd67c

Adapt unpack_epilog for prolog

532dd0a

Add bit_offset parameter to unpack functions

2376c6c

Add unpack prolog

3cabfbe

Simplify test roundtrip logic

46de862

Fix ASAN test error

c425450

Remove BitReader from BitPackedRunDecoder

82ba07c

AntoinePrv and others added 3 commits October 29, 2025 10:26

Check bit_offset size

95271fe

Co-authored-by: Antoine Pitrou <pitrou@free.fr>

Merge mask functions

9ace2e6

Address reviewer comments

53453c8

AntoinePrv force-pushed the bpacking-epilogue branch from 0e0d650 to 53453c8 Compare October 29, 2025 09:26

AntoinePrv requested a review from pitrou October 29, 2025 09:28

pitrou added the CI: Extra: C++ Run extra C++ CI label Oct 29, 2025

pitrou approved these changes Oct 29, 2025

View reviewed changes

pitrou merged commit ff1f71d into apache:main Oct 29, 2025
54 of 58 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Oct 29, 2025

pitrou mentioned this pull request Oct 29, 2025

[C++][Parquet] Unpack function epilog #47895

Closed

AntoinePrv deleted the bpacking-epilogue branch October 29, 2025 11:41

Conversation

AntoinePrv commented Oct 21, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions bot commented Oct 21, 2025

Uh oh!

AntoinePrv commented Oct 21, 2025

Uh oh!

pitrou commented Oct 21, 2025

Uh oh!

pitrou commented Oct 21, 2025

Uh oh!

pitrou commented Oct 21, 2025

Uh oh!

voltrondatabot commented Oct 21, 2025

Uh oh!

conbench-apache-arrow bot commented Oct 21, 2025

Uh oh!

pitrou commented Oct 22, 2025

Uh oh!

voltrondatabot commented Oct 22, 2025

Uh oh!

conbench-apache-arrow bot commented Oct 22, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

AntoinePrv commented Oct 23, 2025

Uh oh!

pitrou commented Oct 23, 2025

Uh oh!

AntoinePrv commented Oct 23, 2025

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

pitrou commented Oct 29, 2025

Uh oh!

github-actions bot commented Oct 29, 2025

Uh oh!

pitrou left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Oct 29, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

AntoinePrv commented Oct 21, 2025 •

edited

Loading