Skip to content

Commit f80b919

Browse files
committed
[SPARK-54849][PYTHON] Upgrade the minimum version of pyarrow to 18.0.0
### What changes were proposed in this pull request? Upgrade the minimum version of `pyarrow` to 18.0.0 ### Why are the changes needed? 1, pyarrow 18.0.0 was released at Oct 28, 2024; 2, there is a security issue [PYSEC-2024-161](https://osv.dev/vulnerability/PYSEC-2024-161) in pyarrow, the affected versions are 4.0.0 ~ 16.1.0, and it is recommended to upgrade to 17+; 3, since 18.0.0, pyarrow no longer depends on numpy, which will make the dependencies simpler to resolve; ### Does this PR introduce _any_ user-facing change? no ### How was this patch tested? PR builder with ``` default: '{"PYSPARK_IMAGE_TO_TEST": "python-minimum", "PYTHON_TO_TEST": "python3.10"}' ``` https://github.com/zhengruifeng/spark/runs/59127283398 ### Was this patch authored or co-authored using generative AI tooling? no Closes #53619 from zhengruifeng/upgrade_arrow_18. Authored-by: Ruifeng Zheng <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
1 parent 4b77986 commit f80b919

File tree

6 files changed

+11
-11
lines changed

6 files changed

+11
-11
lines changed

dev/requirements.txt

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -3,7 +3,7 @@ py4j>=0.10.9.9
33

44
# PySpark dependencies (optional)
55
numpy>=1.22
6-
pyarrow>=15.0.0
6+
pyarrow>=18.0.0
77
six==1.16.0
88
pandas>=2.2.0
99
scipy

dev/spark-test-image/python-minimum/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For PySpark wi
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20250703
27+
ENV FULL_REFRESH_DATE=20251225
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -62,7 +62,7 @@ RUN apt-get update && apt-get install -y \
6262
wget \
6363
zlib1g-dev
6464

65-
ARG BASIC_PIP_PKGS="numpy==1.22.4 pyarrow==15.0.0 pandas==2.2.0 six==1.16.0 scipy scikit-learn coverage unittest-xml-reporting"
65+
ARG BASIC_PIP_PKGS="numpy==1.22.4 pyarrow==18.0.0 pandas==2.2.0 six==1.16.0 scipy scikit-learn coverage unittest-xml-reporting"
6666
# Python deps for Spark Connect
6767
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20 protobuf"
6868

dev/spark-test-image/python-ps-minimum/Dockerfile

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -24,7 +24,7 @@ LABEL org.opencontainers.image.ref.name="Apache Spark Infra Image For Pandas API
2424
# Overwrite this label to avoid exposing the underlying Ubuntu OS version label
2525
LABEL org.opencontainers.image.version=""
2626

27-
ENV FULL_REFRESH_DATE=20250708
27+
ENV FULL_REFRESH_DATE=20251225
2828

2929
ENV DEBIAN_FRONTEND=noninteractive
3030
ENV DEBCONF_NONINTERACTIVE_SEEN=true
@@ -63,7 +63,7 @@ RUN apt-get update && apt-get install -y \
6363
zlib1g-dev
6464

6565

66-
ARG BASIC_PIP_PKGS="pyarrow==15.0.0 pandas==2.2.0 six==1.16.0 numpy scipy coverage unittest-xml-reporting"
66+
ARG BASIC_PIP_PKGS="pyarrow==18.0.0 pandas==2.2.0 six==1.16.0 numpy scipy coverage unittest-xml-reporting"
6767
# Python deps for Spark Connect
6868
ARG CONNECT_PIP_PKGS="grpcio==1.76.0 grpcio-status==1.76.0 googleapis-common-protos==1.71.0 zstandard==0.25.0 graphviz==0.20 protobuf"
6969

python/docs/source/getting_started/install.rst

Lines changed: 4 additions & 4 deletions
Original file line numberDiff line numberDiff line change
@@ -226,7 +226,7 @@ Installable with ``pip install "pyspark[connect]"``.
226226
Package Supported version Note
227227
========================== ================= ==========================
228228
`pandas` >=2.2.0 Required for Spark Connect
229-
`pyarrow` >=15.0.0 Required for Spark Connect
229+
`pyarrow` >=18.0.0 Required for Spark Connect
230230
`grpcio` >=1.76.0 Required for Spark Connect
231231
`grpcio-status` >=1.76.0 Required for Spark Connect
232232
`googleapis-common-protos` >=1.71.0 Required for Spark Connect
@@ -243,7 +243,7 @@ Installable with ``pip install "pyspark[sql]"``.
243243
Package Supported version Note
244244
========= ================= ======================
245245
`pandas` >=2.2.0 Required for Spark SQL
246-
`pyarrow` >=15.0.0 Required for Spark SQL
246+
`pyarrow` >=18.0.0 Required for Spark SQL
247247
========= ================= ======================
248248

249249
Additional libraries that enhance functionality but are not included in the installation packages:
@@ -260,7 +260,7 @@ Installable with ``pip install "pyspark[pandas_on_spark]"``.
260260
Package Supported version Note
261261
========= ================= ================================
262262
`pandas` >=2.2.0 Required for Pandas API on Spark
263-
`pyarrow` >=15.0.0 Required for Pandas API on Spark
263+
`pyarrow` >=18.0.0 Required for Pandas API on Spark
264264
========= ================= ================================
265265

266266
Additional libraries that enhance functionality but are not included in the installation packages:
@@ -310,7 +310,7 @@ Installable with ``pip install "pyspark[pipelines]"``. Includes all dependencies
310310
Package Supported version Note
311311
========================== ================= ===================================================
312312
`pandas` >=2.2.0 Required for Spark Connect and Spark SQL
313-
`pyarrow` >=15.0.0 Required for Spark Connect and Spark SQL
313+
`pyarrow` >=18.0.0 Required for Spark Connect and Spark SQL
314314
`grpcio` >=1.76.0 Required for Spark Connect
315315
`grpcio-status` >=1.76.0 Required for Spark Connect
316316
`googleapis-common-protos` >=1.71.0 Required for Spark Connect

python/pyspark/sql/classic/dataframe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2011,7 +2011,7 @@ def _test() -> None:
20112011
import pyarrow as pa
20122012
from pyspark.loose_version import LooseVersion
20132013

2014-
if LooseVersion(pa.__version__) < LooseVersion("17.0.0"):
2014+
if LooseVersion(pa.__version__) < LooseVersion("21.0.0"):
20152015
del pyspark.sql.dataframe.DataFrame.mapInArrow.__doc__
20162016

20172017
spark = (

python/pyspark/sql/connect/dataframe.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -2376,7 +2376,7 @@ def _test() -> None:
23762376
import pyarrow as pa
23772377
from pyspark.loose_version import LooseVersion
23782378

2379-
if LooseVersion(pa.__version__) < LooseVersion("17.0.0"):
2379+
if LooseVersion(pa.__version__) < LooseVersion("21.0.0"):
23802380
del pyspark.sql.dataframe.DataFrame.mapInArrow.__doc__
23812381

23822382
globs["spark"] = (

0 commit comments

Comments
 (0)