Skip to content

GH-47981: [C++][Parquet] Add compatibility with non-compliant RLE stream#47992

Merged
pitrou merged 4 commits intoapache:mainfrom
pitrou:gh47981-pq-impala-file
Nov 13, 2025
Merged

GH-47981: [C++][Parquet] Add compatibility with non-compliant RLE stream#47992
pitrou merged 4 commits intoapache:mainfrom
pitrou:gh47981-pq-impala-file

Conversation

@pitrou
Copy link
Member

@pitrou pitrou commented Oct 29, 2025

Rationale for this change

RLE-bit-packed streams are required by the Parquet spec to have 8-padded bit-packed runs, but some non-compliant encoders (such as Polars versions before pola-rs/polars#13883) might generate a truncated last bit-packed run, which nevertheless contains enough logical values.

What changes are included in this PR?

  1. Compatibility code for non-compliant RLE streams as described above
  2. Guard against zero-size dictionaries to avoid hitting an assertion in DictionaryConverter

Are these changes tested?

Yes, by additional unit tests.

Are there any user-facing changes?

No, except a bugfix.

@pitrou
Copy link
Member Author

pitrou commented Oct 29, 2025

@AntoinePrv FYI

@pitrou pitrou force-pushed the gh47981-pq-impala-file branch from b3eadff to b362c3a Compare November 10, 2025 08:13
@pitrou pitrou marked this pull request as ready for review November 10, 2025 09:37
@pitrou pitrou requested a review from wgtmac November 10, 2025 09:38
@pitrou
Copy link
Member Author

pitrou commented Nov 10, 2025

@AntoinePrv Are you available to review this?

@pitrou pitrou force-pushed the gh47981-pq-impala-file branch from fcb3c55 to f613124 Compare November 10, 2025 10:06
@pitrou
Copy link
Member Author

pitrou commented Nov 10, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit f613124. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou pitrou force-pushed the gh47981-pq-impala-file branch from f613124 to ba8cd20 Compare November 10, 2025 13:16
@pitrou
Copy link
Member Author

pitrou commented Nov 10, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit ba8cd20. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit f613124.

There were 9 benchmark results indicating a performance regression:

The full Conbench report has more details.

@github-actions github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 10, 2025
@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ba8cd20.

There were 6 benchmark results indicating a performance regression:

The full Conbench report has more details.

@pitrou pitrou force-pushed the gh47981-pq-impala-file branch from ba8cd20 to 959327f Compare November 12, 2025 09:01
@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2025

@ursabot please benchmark lang=C++

@voltrondatabot
Copy link

Benchmark runs are scheduled for commit 959327f. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

@pitrou
Copy link
Member Author

pitrou commented Nov 12, 2025

@github-actions crossbow submit -g cpp

@pitrou pitrou requested a review from adamreeve November 12, 2025 09:37
@github-actions
Copy link

Revision: 959327f

Submitted crossbow builds: ursacomputing/crossbow @ actions-f55022a5e6

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@conbench-apache-arrow
Copy link

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 959327f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

Copy link
Member

@wgtmac wgtmac left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. I just have a minor question.

// Bit-packed run
constexpr auto kMaxCount = bit_util::CeilDiv(internal::max_size_for_v<rle_size_t>, 8);
if (ARROW_PREDICT_FALSE(count == 0 || count > kMaxCount)) {
if (ARROW_PREDICT_FALSE(count == 0 || count >= kMaxCount)) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this change?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because kMaxCount uses CeilDiv and is therefore one large than the actual max.

That said, this is a bit confusing so I could try to change to kMaxCount = internal::max_size_for_v<rle_size_t> / 8.

@pitrou pitrou force-pushed the gh47981-pq-impala-file branch from 959327f to 5fe9995 Compare November 13, 2025 08:47
@pitrou
Copy link
Member Author

pitrou commented Nov 13, 2025

@github-actions crossbow submit -g cpp

@github-actions
Copy link

Revision: 5fe9995

Submitted crossbow builds: ursacomputing/crossbow @ actions-926cb57385

Task Status
example-cpp-minimal-build-static GitHub Actions
example-cpp-minimal-build-static-system-dependency GitHub Actions
example-cpp-tutorial GitHub Actions
test-build-cpp-fuzz GitHub Actions
test-conda-cpp GitHub Actions
test-conda-cpp-valgrind GitHub Actions
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-cpp-amd64 GitHub Actions
test-debian-12-cpp-i386 GitHub Actions
test-fedora-42-cpp GitHub Actions
test-ubuntu-22.04-cpp GitHub Actions
test-ubuntu-22.04-cpp-20 GitHub Actions
test-ubuntu-22.04-cpp-bundled GitHub Actions
test-ubuntu-22.04-cpp-emscripten GitHub Actions
test-ubuntu-22.04-cpp-no-threading GitHub Actions
test-ubuntu-24.04-cpp GitHub Actions
test-ubuntu-24.04-cpp-bundled-offline GitHub Actions
test-ubuntu-24.04-cpp-gcc-13-bundled GitHub Actions
test-ubuntu-24.04-cpp-gcc-14 GitHub Actions
test-ubuntu-24.04-cpp-minimal-with-formats GitHub Actions
test-ubuntu-24.04-cpp-thread-sanitizer GitHub Actions

@pitrou pitrou merged commit 0c20874 into apache:main Nov 13, 2025
44 of 47 checks passed
@pitrou pitrou removed the awaiting committer review Awaiting committer review label Nov 13, 2025
@pitrou pitrou deleted the gh47981-pq-impala-file branch November 13, 2025 10:24
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0c20874.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants