GH-47981: [C++][Parquet] Add compatibility with non-compliant RLE stream by pitrou · Pull Request #47992 · apache/arrow

pitrou · 2025-10-29T09:31:30Z

Rationale for this change

RLE-bit-packed streams are required by the Parquet spec to have 8-padded bit-packed runs, but some non-compliant encoders (such as Polars versions before pola-rs/polars#13883) might generate a truncated last bit-packed run, which nevertheless contains enough logical values.

What changes are included in this PR?

Compatibility code for non-compliant RLE streams as described above
Guard against zero-size dictionaries to avoid hitting an assertion in DictionaryConverter

Are these changes tested?

Yes, by additional unit tests.

Are there any user-facing changes?

No, except a bugfix.

GitHub Issue: [C++][Python][Parquet] pyarrow.lib.ArrowInvalid: Invalid number of indices: 0 when reading a parquet file #47981

pitrou · 2025-10-29T09:39:42Z

@AntoinePrv FYI

pitrou · 2025-11-10T09:38:18Z

@AntoinePrv Are you available to review this?

pitrou · 2025-11-10T10:11:52Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-11-10T10:12:00Z

Benchmark runs are scheduled for commit f613124. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-11-10T13:36:11Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-11-10T13:36:18Z

Benchmark runs are scheduled for commit ba8cd20. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

conbench-apache-arrow · 2025-11-10T15:52:11Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit f613124.

There were 9 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-10 11:21:29Z
- GrouperWithMultiTypes (C++) with params="{fixed_size_binary(32)}"/4096/1, source=cpp-micro, suite=arrow-compute-grouper-benchmark
- GrouperWithMultiTypes (C++) with params="{fixed_size_binary(32)}"/1024/1, source=cpp-micro, suite=arrow-compute-grouper-benchmark
and 7 more (see the report linked below)

The full Conbench report has more details.

cpp/src/arrow/util/rle_encoding_internal.h

conbench-apache-arrow · 2025-11-10T19:27:22Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ba8cd20.

There were 6 benchmark results indicating a performance regression:

Pull Request Run on amd64-c6a-4xlarge-linux at 2025-11-10 14:44:52Z
- GrouperWithMultiTypes (C++) with params="{fixed_size_binary(32)}"/4096/1, source=cpp-micro, suite=arrow-compute-grouper-benchmark
- GrouperWithMultiTypes (C++) with params="{fixed_size_binary(32)}"/1024/1, source=cpp-micro, suite=arrow-compute-grouper-benchmark
and 4 more (see the report linked below)

The full Conbench report has more details.

pitrou · 2025-11-12T09:15:38Z

@ursabot please benchmark lang=C++

voltrondatabot · 2025-11-12T09:15:44Z

Benchmark runs are scheduled for commit 959327f. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete.

pitrou · 2025-11-12T09:36:55Z

@github-actions crossbow submit -g cpp

github-actions · 2025-11-12T09:39:38Z

Revision: 959327f

Submitted crossbow builds: ursacomputing/crossbow @ actions-f55022a5e6

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

conbench-apache-arrow · 2025-11-12T11:08:46Z

Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 959327f.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details.

wgtmac

LGTM. I just have a minor question.

wgtmac · 2025-11-13T03:30:50Z

cpp/src/arrow/util/rle_encoding_internal.h

    // Bit-packed run
    constexpr auto kMaxCount = bit_util::CeilDiv(internal::max_size_for_v<rle_size_t>, 8);
-    if (ARROW_PREDICT_FALSE(count == 0 || count > kMaxCount)) {
+    if (ARROW_PREDICT_FALSE(count == 0 || count >= kMaxCount)) {


Why do we need this change?

Because kMaxCount uses CeilDiv and is therefore one large than the actual max.

That said, this is a bit confusing so I could try to change to kMaxCount = internal::max_size_for_v<rle_size_t> / 8.

…LE stream

pitrou · 2025-11-13T08:47:29Z

@github-actions crossbow submit -g cpp

github-actions · 2025-11-13T08:50:09Z

Revision: 5fe9995

Submitted crossbow builds: ursacomputing/crossbow @ actions-926cb57385

Task	Status
example-cpp-minimal-build-static
example-cpp-minimal-build-static-system-dependency
example-cpp-tutorial
test-build-cpp-fuzz
test-conda-cpp
test-conda-cpp-valgrind
test-cuda-cpp-ubuntu-22.04-cuda-11.7.1
test-debian-12-cpp-amd64
test-debian-12-cpp-i386
test-fedora-42-cpp
test-ubuntu-22.04-cpp
test-ubuntu-22.04-cpp-20
test-ubuntu-22.04-cpp-bundled
test-ubuntu-22.04-cpp-emscripten
test-ubuntu-22.04-cpp-no-threading
test-ubuntu-24.04-cpp
test-ubuntu-24.04-cpp-bundled-offline
test-ubuntu-24.04-cpp-gcc-13-bundled
test-ubuntu-24.04-cpp-gcc-14
test-ubuntu-24.04-cpp-minimal-with-formats
test-ubuntu-24.04-cpp-thread-sanitizer

conbench-apache-arrow · 2025-11-13T17:27:22Z

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0c20874.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them.

github-actions bot added Component: C++ awaiting review Awaiting review labels Oct 29, 2025

pitrou force-pushed the gh47981-pq-impala-file branch from b3eadff to b362c3a Compare November 10, 2025 08:13

pitrou marked this pull request as ready for review November 10, 2025 09:37

pitrou requested a review from wgtmac November 10, 2025 09:38

pitrou force-pushed the gh47981-pq-impala-file branch from fcb3c55 to f613124 Compare November 10, 2025 10:06

pitrou force-pushed the gh47981-pq-impala-file branch from f613124 to ba8cd20 Compare November 10, 2025 13:16

pitrou added CI: Extra: C++ Run extra C++ CI backport-candidate labels Nov 10, 2025

AntoinePrv reviewed Nov 10, 2025

View reviewed changes

cpp/src/arrow/util/rle_encoding_internal.h Show resolved Hide resolved

github-actions bot added awaiting committer review Awaiting committer review and removed awaiting review Awaiting review labels Nov 10, 2025

pitrou force-pushed the gh47981-pq-impala-file branch from ba8cd20 to 959327f Compare November 12, 2025 09:01

pitrou requested a review from adamreeve November 12, 2025 09:37

adamreeve approved these changes Nov 12, 2025

View reviewed changes

wgtmac approved these changes Nov 13, 2025

View reviewed changes

apacheGH-47981: [C++][Parquet] Add compatibility with non-compliant R…

cf30d28

…LE stream

pitrou added 3 commits November 13, 2025 09:38

Add unit tests

5d3abb0

Add test + re-add explicit error check

d03c7eb

Simplify kMaxCount

5fe9995

pitrou force-pushed the gh47981-pq-impala-file branch from 959327f to 5fe9995 Compare November 13, 2025 08:47

pitrou merged commit 0c20874 into apache:main Nov 13, 2025
44 of 47 checks passed

pitrou removed the awaiting committer review Awaiting committer review label Nov 13, 2025

pitrou mentioned this pull request Nov 13, 2025

[C++][Python][Parquet] pyarrow.lib.ArrowInvalid: Invalid number of indices: 0 when reading a parquet file #47981

Closed

pitrou deleted the gh47981-pq-impala-file branch November 13, 2025 10:24

thisisnic mentioned this pull request Nov 13, 2025

[R][Parquet] Error: Invalid: Invalid number of indices: 0 with read_parquet #48066

Closed

LucaMarconato mentioned this pull request Dec 17, 2025

PyArrow OSError: Unexpected end of stream scverse/spatialdata-io#334

Open

raulcd removed the backport-candidate label Jan 29, 2026

Conversation

pitrou commented Oct 29, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

pitrou commented Oct 29, 2025

Uh oh!

pitrou commented Nov 10, 2025

Uh oh!

pitrou commented Nov 10, 2025

Uh oh!

voltrondatabot commented Nov 10, 2025

Uh oh!

pitrou commented Nov 10, 2025

Uh oh!

voltrondatabot commented Nov 10, 2025

Uh oh!

conbench-apache-arrow bot commented Nov 10, 2025

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Nov 10, 2025

Uh oh!

pitrou commented Nov 12, 2025

Uh oh!

voltrondatabot commented Nov 12, 2025

Uh oh!

pitrou commented Nov 12, 2025

Uh oh!

github-actions bot commented Nov 12, 2025

Uh oh!

conbench-apache-arrow bot commented Nov 12, 2025

Uh oh!

wgtmac left a comment

Choose a reason for hiding this comment

Uh oh!

wgtmac Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou Nov 13, 2025

Choose a reason for hiding this comment

Uh oh!

pitrou commented Nov 13, 2025

Uh oh!

github-actions bot commented Nov 13, 2025

Uh oh!

Uh oh!

conbench-apache-arrow bot commented Nov 13, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

6 participants

pitrou commented Oct 29, 2025 •

edited

Loading