Skip to content

GH-47895: [C++][Parquet] Add prolog and epilog in unpack#47896

Merged
pitrou merged 12 commits intoapache:mainfrom
AntoinePrv:bpacking-epilogue
Oct 29, 2025
Merged

GH-47895: [C++][Parquet] Add prolog and epilog in unpack#47896
pitrou merged 12 commits intoapache:mainfrom
AntoinePrv:bpacking-epilogue

Conversation

@AntoinePrv
Copy link
Contributor

@AntoinePrv AntoinePrv commented Oct 21, 2025

Rationale for this change

  • Simplify the use of unpack
  • Reduce code spread for unpacking integers

What changes are included in this PR?

  • epilog: unpack extract exactly the required number of values -> change return type to void.
  • prolog: unpack can handled non aligned data -> include bit_offset in input parameters.
  • Include prolog/epilog cases in unpack tests.
  • Simplify a roundtrip test from packed -> unpacked -> packed to unpacked -> packed -> unpacked

Decoder benchmarks should remain the same (tested linux x86-64).
I have not benchmark the unpack functions themselves but I don't believe it's relevant since they now do more work.

Are these changes tested?

Yes

Are there any user-facing changes?

No

@github-actions
Copy link

⚠️ GitHub issue #47895 has been automatically assigned in GitHub to PR creator.

@AntoinePrv AntoinePrv changed the title GH-47895: [C++][Parquet] Add prolog and eiplog in unpack GH-47895: [C++][Parquet] Add prolog and epilog in unpack Oct 21, 2025
@AntoinePrv
Copy link
Contributor Author

@pitrou this is ready for review (waiting for CI to finish here).

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

There's a sanitizer failure that needs fixing here:
https://github.com/apache/arrow/actions/runs/18688262707/job/53286800716?pr=47896#step:7:8798

(I suppose it happens when length == 0...)

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

With this we could also investigate removing the BitReader from the BitPackedRunDecoder, reducing the general complexity seen by the compilers (number of member variables, pointers and offsets bookkeeping...).

Perhaps that can even be done in this PR? It doesn't sound very complicated...

@pitrou
Copy link
Member

pitrou commented Oct 21, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 600696c. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 2 benchmarking runs that have been run so far on PR commit 600696c.

There weren't enough matching historic benchmark results to make a call on whether there were regressions.

The full Conbench report has more details.

@pitrou
Copy link
Member

pitrou commented Oct 22, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 287f136. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 287f136.

There were 7 benchmark results indicating a performance regression:

The full Conbench report has more details.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some comments on the implementation. I haven't looked at the bpacking tests.

const int spread = byte_end - byte_start + 1;
max = spread > max ? spread : max;
start += width;
} while (start % 8 != bit_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note that this will be an infinite loop if bit_offset >= 8 (hence the DCHECK suggestion below)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed, though that function is never used at runtime, only compile time.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Oct 22, 2025
@AntoinePrv
Copy link
Contributor Author

@pitrou removing the MaxSpread constexpr logic did not perform well. Up to -20% on decoding benchmarks.

For reference: that MaxSpread metric is central to the shuffle SIMD algorithm I'm working on:

  • If small: we can "pack" multiple values in the shuffle and reuse it with multiple rshifts
  • If very large: we have to do something radically different for packed values that spread over >8 bytes

@pitrou
Copy link
Member

pitrou commented Oct 23, 2025

Ah, sorry! Let's just restore it then :)

@AntoinePrv
Copy link
Contributor Author

I never pushed it, this was on a local benchmark.

// Easy case to handle, simply setting memory to zero.
return unpack_null(in, out, batch_size);
} else {
// In case of misalignment, we need to run the prolog until aligned.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a TODO, if batch_size is large enough, we can perhaps rewind to the last byte-aligned packed and SIMD-unpack kValuesUnpacked into a local buffer, instead of going through unpack_exact.

(this seems lower-priority than SIMD shuffling, though)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here my suspicion is that we rarely pass unaligned inputs in practice.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a reasonable assumption indeed. We could later trace through the test suite (add logs for example?) and see whether that actually happens.

ARROW_DCHECK_GE(batch_size, 0);
ARROW_COMPILER_ASSUME(batch_size < kValuesUnpacked);
ARROW_COMPILER_ASSUME(batch_size >= 0);
unpack_exact<kPackedBitWidth, false>(in, out, batch_size, /* bit_offset= */ 0);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, if there's enough padding at the end of the input, we could SIMD-unpack a full kValuesUnpacked into a local buffer.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This one I thought about, but it would require passing another parameter to know where the input buffer ends (regardless of the batch_size). We could also know where the output buffer ends and unpack to even skip the local buffer.

I think we should investigate this and only do it if there is a benefit. Though that must be after the (hopefully) changes to the SIMD shuffles because they change the size of the iterations.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, if we want to use SIMD shuffles maximally, we probably want to pass the input buffer end indeed.

But that can be done later.

switch (num_bits) {
case 0:
return unpack_null(in, out, batch_size);
return unpack_width<0, Unpacker>(in, out, batch_size, bit_offset);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, macros are not pretty, but we could have a macro here to minimize diffs when changing these function signatures :-)

Such as:

#define CASE_UNPACK_WIDTH(_width) \
        return unpack_width<_width, Unpacker>(in, out, batch_size, bit_offset)

  if constexpr (std::is_same_v<UnpackedUint, bool>) {
    switch (num_bits) {
      case 0:
        CASE_UNPACK_WIDTH(0);
  // etc.

#undef CASE_UNPACK_WIDTH

As you prefer, though.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't mind if someone wants to change it but I'm terrible at writing them.

AntoinePrv and others added 3 commits October 29, 2025 10:26
@AntoinePrv AntoinePrv requested a review from pitrou October 29, 2025 09:28
@pitrou pitrou added the CI: Extra: C++ Run extra C++ CI label Oct 29, 2025
@pitrou
Copy link
Member

pitrou commented Oct 29, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 53453c8

Submitted crossbow builds: ursacomputing/crossbow @ actions-add562e242

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 from me. Really neat work @AntoinePrv !

@pitrou pitrou merged commit ff1f71d into apache:main Oct 29, 2025
54 of 58 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Oct 29, 2025
@AntoinePrv AntoinePrv deleted the bpacking-epilogue branch October 29, 2025 11:41
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit ff1f71d.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 68 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants