Skip to content

Commit 2ab68d1

Browse files
asl3zhengruifeng
authored andcommitted
[SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc
### What changes were proposed in this pull request? Add doc about arrow by default enablement in Spark 4.2, for this page: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html Also add an example specifying how to opt out of arrow optimization, on a per-UDF and per-session level. ### Why are the changes needed? In Spark 4.2.0, we will enable arrow-optimization for Python UD(T)Fs by default. (see: [SPARK-54555](https://issues.apache.org/jira/browse/SPARK-54555)). Docs should be updated to note the change and include more code examples. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? Docs build tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #53632 from asl3/pyspark-apiref-arrowudfdoc. Authored-by: Amanda Liu <[email protected]> Signed-off-by: Ruifeng Zheng <[email protected]>
1 parent 2fb3359 commit 2ab68d1

File tree

1 file changed

+25
-0
lines changed

1 file changed

+25
-0
lines changed

python/pyspark/sql/functions/builtin.py

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -28927,6 +28927,9 @@ def udf(
2892728927
.. versionchanged:: 4.0.0
2892828928
Supports keyword-arguments.
2892928929

28930+
.. versionchanged:: 4.2.0
28931+
Uses Arrow by default for (de)serialization.
28932+
2893028933
Parameters
2893128934
----------
2893228935
f : function, optional
@@ -29029,6 +29032,28 @@ def udf(
2902929032
| 101|
2903029033
+--------------------------------+
2903129034

29035+
Arrow-optimized Python UDFs (default since Spark 4.2):
29036+
29037+
Since Spark 4.2, Arrow is used by default for (de)serialization between the JVM
29038+
and Python for regular Python UDFs.
29039+
29040+
Unlike the vectorized Arrow UDFs above that receive and return ``pyarrow.Array`` objects,
29041+
Arrow-optimized Python UDFs still process data row-by-row with regular Python types,
29042+
but use Arrow for more efficient data transfer in the (de)serialization process.
29043+
29044+
>>> # Arrow optimization is enabled by default since Spark 4.2
29045+
>>> @udf(returnType=IntegerType())
29046+
... def my_udf(x):
29047+
... return x + 1
29048+
...
29049+
>>> # To explicitly disable Arrow optimization and use pickle-based serialization:
29050+
>>> @udf(returnType=IntegerType(), useArrow=False)
29051+
... def legacy_udf(x):
29052+
... return x + 1
29053+
...
29054+
>>> # To disable Arrow optimization for the entire SparkSession:
29055+
>>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", "false") # doctest: +SKIP
29056+
2903229057
See Also
2903329058
--------
2903429059
:meth:`pyspark.sql.functions.pandas_udf`

0 commit comments

Comments
 (0)