GH-47103: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47969
Conversation
|
|
|
@kou |
|
Yes. I'll do it. |
There was a problem hiding this comment.
Pull Request Overview
This PR adds support for approximate null counts in Arrow array statistics by extending the null_count field to support both exact (int64_t) and approximate (double) values using std::variant. This aligns with the existing pattern used for distinct_count.
Key changes:
- Changed
null_countfromstd::optional<int64_t>tostd::optional<CountType>(variant of int64_t and double) - Added
is_null_count_exactproperty to distinguish between exact and approximate null counts - Updated all related tests and comparison logic to handle the variant type
Reviewed Changes
Copilot reviewed 10 out of 10 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| cpp/src/arrow/array/statistics.h | Changed null_count type from int64_t to CountType variant |
| cpp/src/arrow/record_batch.cc | Added logic to handle both exact and approximate null counts when creating statistics arrays |
| cpp/src/arrow/compare.cc | Updated equality comparison to use ArrayStatisticsOptionalValueEquals for null_count |
| cpp/src/arrow/record_batch_test.cc | Renamed test and added new test for approximate null count |
| cpp/src/arrow/array/statistics_test.cc | Added test for approximate null count and updated existing tests |
| cpp/src/arrow/array/array_test.cc | Updated variable type and test assertions to handle variant null_count |
| cpp/src/parquet/arrow/arrow_statistics_test.cc | Updated assertions to extract int64_t from variant null_count |
| python/pyarrow/includes/libarrow.pxd | Changed null_count type from int64_t to CArrayStatisticsCountType |
| python/pyarrow/array.pxi | Added is_null_count_exact property and updated null_count to handle variant |
| python/pyarrow/tests/parquet/test_parquet_file.py | Added assertion to verify is_null_count_exact is True |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@kou, can you check this out? |
3ea2e5b to
20d376f
Compare
remover logger header correct is_null_count_exact comment
|
Oh, I forgot to add Ruby related changes. I'll add them. |
20d376f to
3983dcb
Compare
|
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit c10847c. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 14 possible false positives for unstable benchmarks that are known to sometimes produce them. |
Rationale for this change
Enable
ARROW:null_count:approximatesupport forarrow::ArrayStatistics, along with the corresponding GLib, Ruby and Python bindings.What changes are included in this PR?
Enable
ARROW:null_count:approximatein C++ and bind it toArrayStatisticsin GLib, Ruby and Python.Are these changes tested?
Yes, I ran the relevant unit tests.
Are there any user-facing changes?
Yes.
The type of
arrow::ArrayStatistics::null_counthas been changed fromstd::optional<int64_t>tostd::optional<CountType>New
garrow_array_statistics_is_null_count_exact()/garrow_array_statistics_get_null_count_{exact,approximate}()functions in GLib.Add support for approximate value in
Arrow::ArrayStatistics#null_countin Ruby.A new field
is_null_count_exacthas been added toArrayStatisticsin Python.GitHub Issue: [Statistics][C++] Implement Statistics specification attribute ARROW:null_count:approximate #47103