GH-47981: [C++][Parquet] Add compatibility with non-compliant RLE stream#47992
GH-47981: [C++][Parquet] Add compatibility with non-compliant RLE stream#47992pitrou merged 4 commits intoapache:mainfrom
Conversation
|
@AntoinePrv FYI |
b3eadff to
b362c3a
Compare
|
@AntoinePrv Are you available to review this? |
fcb3c55 to
f613124
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit f613124. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
f613124 to
ba8cd20
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit ba8cd20. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit f613124. There were 9 benchmark results indicating a performance regression:
The full Conbench report has more details. |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit ba8cd20. There were 6 benchmark results indicating a performance regression:
The full Conbench report has more details. |
ba8cd20 to
959327f
Compare
|
@ursabot please benchmark lang=C++ |
|
Benchmark runs are scheduled for commit 959327f. Watch https://buildkite.com/apache-arrow and https://conbench.ursa.dev for updates. A comment will be posted here when the runs are complete. |
|
@github-actions crossbow submit -g cpp |
|
Revision: 959327f Submitted crossbow builds: ursacomputing/crossbow @ actions-f55022a5e6 |
|
Thanks for your patience. Conbench analyzed the 4 benchmarking runs that have been run so far on PR commit 959327f. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. |
wgtmac
left a comment
There was a problem hiding this comment.
LGTM. I just have a minor question.
| // Bit-packed run | ||
| constexpr auto kMaxCount = bit_util::CeilDiv(internal::max_size_for_v<rle_size_t>, 8); | ||
| if (ARROW_PREDICT_FALSE(count == 0 || count > kMaxCount)) { | ||
| if (ARROW_PREDICT_FALSE(count == 0 || count >= kMaxCount)) { |
There was a problem hiding this comment.
Because kMaxCount uses CeilDiv and is therefore one large than the actual max.
That said, this is a bit confusing so I could try to change to kMaxCount = internal::max_size_for_v<rle_size_t> / 8.
959327f to
5fe9995
Compare
|
@github-actions crossbow submit -g cpp |
|
Revision: 5fe9995 Submitted crossbow builds: ursacomputing/crossbow @ actions-926cb57385 |
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0c20874. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 7 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
RLE-bit-packed streams are required by the Parquet spec to have 8-padded bit-packed runs, but some non-compliant encoders (such as Polars versions before pola-rs/polars#13883) might generate a truncated last bit-packed run, which nevertheless contains enough logical values.
What changes are included in this PR?
DictionaryConverterAre these changes tested?
Yes, by additional unit tests.
Are there any user-facing changes?
No, except a bugfix.
pyarrow.lib.ArrowInvalid: Invalid number of indices: 0when reading a parquet file #47981