[SPARK-54859][PYTHON] Arrow by default PySpark UDF API reference doc

asl3 · zhengruifeng · commit 2ab68d10e806 · 2025-12-31T09:27:05.000+08:00
### What changes were proposed in this pull request? Add doc about arrow by default enablement in Spark 4.2, for this page: https://spark.apache.org/docs/latest/api/python/reference/pyspark.sql/api/pyspark.sql.functions.udf.html Also add an example specifying how to opt out of arrow optimization, on a per-UDF and per-session level. ### Why are the changes needed? In Spark 4.2.0, we will enable arrow-optimization for Python UD(T)Fs by default. (see: [SPARK-54555](https://issues.apache.org/jira/browse/SPARK-54555)). Docs should be updated to note the change and include more code examples. ### Does this PR introduce _any_ user-facing change? No, this is a documentation-only update. ### How was this patch tested? Docs build tests ### Was this patch authored or co-authored using generative AI tooling? No Closes #53632 from asl3/pyspark-apiref-arrowudfdoc. Authored-by: Amanda Liu <amanda.liu@databricks.com> Signed-off-by: Ruifeng Zheng <ruifengz@apache.org>
diff --git a/python/pyspark/sql/functions/builtin.py b/python/pyspark/sql/functions/builtin.py
@@ -28927,6 +28927,9 @@ def udf(
     .. versionchanged:: 4.0.0
         Supports keyword-arguments.
 
+    .. versionchanged:: 4.2.0
+        Uses Arrow by default for (de)serialization.
+
     Parameters
     ----------
     f : function, optional
@@ -29029,6 +29032,28 @@ def udf(
     |                             101|
     +--------------------------------+
 
+    Arrow-optimized Python UDFs (default since Spark 4.2):
+
+    Since Spark 4.2, Arrow is used by default for (de)serialization between the JVM
+    and Python for regular Python UDFs.
+
+    Unlike the vectorized Arrow UDFs above that receive and return ``pyarrow.Array`` objects,
+    Arrow-optimized Python UDFs still process data row-by-row with regular Python types,
+    but use Arrow for more efficient data transfer in the (de)serialization process.
+
+    >>> # Arrow optimization is enabled by default since Spark 4.2
+    >>> @udf(returnType=IntegerType())
+    ... def my_udf(x):
+    ...     return x + 1
+    ...
+    >>> # To explicitly disable Arrow optimization and use pickle-based serialization:
+    >>> @udf(returnType=IntegerType(), useArrow=False)
+    ... def legacy_udf(x):
+    ...     return x + 1
+    ...
+    >>> # To disable Arrow optimization for the entire SparkSession:
+    >>> spark.conf.set("spark.sql.execution.pythonUDF.arrow.enabled", "false")  # doctest: +SKIP
+
     See Also
     --------
     :meth:`pyspark.sql.functions.pandas_udf`