Skip to content

GH-45457: [Python] Add pyarrow.ArrayStatistics#45550

Merged
kou merged 10 commits intoapache:mainfrom
kou:python-array-statistics
Feb 25, 2025
Merged

GH-45457: [Python] Add pyarrow.ArrayStatistics#45550
kou merged 10 commits intoapache:mainfrom
kou:python-array-statistics

Conversation

@kou
Copy link
Member

@kou kou commented Feb 17, 2025

Rationale for this change

Apache Arrow C++ can attach statistics read from Apache Parquet data to arrow::Array. If we have the bindings of the feature in Python, Python users can also use attached statistics.

What changes are included in this PR?

  • Add pyarrow.ArrayStatistics
  • Add pyarrow.Array.statistics().

Are these changes tested?

Yes.

Are there any user-facing changes?

Yes.

@kou
Copy link
Member Author

kou commented Feb 17, 2025

@github-actions crossbow submit -g python

@github-actions
Copy link

⚠️ GitHub issue #45457 has been automatically assigned in GitHub to PR creator.

@github-actions

This comment was marked as outdated.

@kou kou requested review from jorisvandenbossche and pitrou and removed request for pitrou February 18, 2025 00:45
@kou
Copy link
Member Author

kou commented Feb 20, 2025

@pitrou @jorisvandenbossche Could you take a look at this?

@kou
Copy link
Member Author

kou commented Feb 24, 2025

I'll merge this in a few days if nobody objects it.

Copy link
Member

@pitrou pitrou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot @kou ! Some minor comments below, but LGTM in general.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, I've opened a Cython feature request to make this more automatic.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I've added a comment that refers the issue.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

uint64_t isn't handled below, should the docstring or the code be fixed?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh... The code was wrong... I've added the uint64_t case.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

except * means it could raise Python exceptions, but it doesn't here, so perhaps you can remove that annotation (though it's not really a problem either).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I didn't know much about except in Cython...

kou and others added 4 commits February 25, 2025 14:15
It's the bindings of `arrow::ArrayStatistics`. You can get it by
`pyarrow.Array.statistics()`.
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
Co-authored-by: Antoine Pitrou <pitrou@free.fr>
@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

@github-actions

This comment was marked as outdated.

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Feb 25, 2025
@github-actions

This comment was marked as outdated.

assert statistics.min == -1
assert statistics.is_min_exact
assert statistics.max == 3
assert statistics.is_max_exact
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we have a test for repr(statistics) to make sure that the string representation works?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a good idea. I've added it.

@kou
Copy link
Member Author

kou commented Feb 25, 2025

@github-actions crossbow submit -g python

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting change review Awaiting change review labels Feb 25, 2025
@github-actions
Copy link

Revision: e3a20b5

Submitted crossbow builds: ursacomputing/crossbow @ actions-747dbaddf2

Task Status
example-python-minimal-build-fedora-conda GitHub Actions
example-python-minimal-build-ubuntu-venv GitHub Actions
test-conda-python-3.10 GitHub Actions
test-conda-python-3.10-hdfs-2.9.2 GitHub Actions
test-conda-python-3.10-hdfs-3.2.1 GitHub Actions
test-conda-python-3.10-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11 GitHub Actions
test-conda-python-3.11-dask-latest GitHub Actions
test-conda-python-3.11-dask-upstream_devel GitHub Actions
test-conda-python-3.11-hypothesis GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-1.26 GitHub Actions
test-conda-python-3.11-pandas-latest-numpy-latest GitHub Actions
test-conda-python-3.11-pandas-nightly-numpy-nightly GitHub Actions
test-conda-python-3.11-pandas-upstream_devel-numpy-nightly GitHub Actions
test-conda-python-3.11-spark-master GitHub Actions
test-conda-python-3.12 GitHub Actions
test-conda-python-3.12-cpython-debug GitHub Actions
test-conda-python-3.13 GitHub Actions
test-conda-python-3.9 GitHub Actions
test-conda-python-3.9-pandas-1.1.3-numpy-1.19.5 GitHub Actions
test-conda-python-emscripten GitHub Actions
test-cuda-python-ubuntu-22.04-cuda-11.7.1 GitHub Actions
test-debian-12-python-3-amd64 GitHub Actions
test-debian-12-python-3-i386 GitHub Actions
test-fedora-39-python-3 GitHub Actions
test-ubuntu-22.04-python-3 GitHub Actions
test-ubuntu-22.04-python-313-freethreading GitHub Actions
test-ubuntu-24.04-python-3 GitHub Actions

@kou kou merged commit 631fa0a into apache:main Feb 25, 2025
15 checks passed
@kou kou removed the awaiting changes Awaiting changes label Feb 25, 2025
@kou kou deleted the python-array-statistics branch February 25, 2025 13:25
@conbench-apache-arrow
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 631fa0a.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 12 possible false positives for unstable benchmarks that are known to sometimes produce them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants