[SPARK-55349][PYTHON] Consolidate pandas-to-Arrow conversion utilities in serializers by Yicong-Huang · Pull Request #54125 · apache/spark

Yicong-Huang · 2026-02-04T01:08:00Z

What changes were proposed in this pull request?

Introduce PandasToArrowConversion.convert() in conversion.py to centralize the pandas-to-Arrow conversion logic previously duplicated across multiple serializers. Also extract cast_arrow_array() as a standalone utility for Arrow array type casting.

Serializers (ArrowStreamPandasSerializer, ArrowStreamPandasUDFSerializer, ArrowStreamPandasUDTFSerializer, etc.) now delegate to these shared utilities instead of maintaining their own _create_array, _create_batch, and _create_struct_array methods.

Why are the changes needed?

Part of SPARK-55159. The same conversion logic was duplicated across 5+ serializer classes, making it hard to maintain. This reduces ~450 lines of duplication.

Does this PR introduce any user-facing change?

No.

How was this patch tested?

New unit tests in test_conversion.py for PandasToArrowConversion, plus existing UDF/UDTF tests.

Was this patch authored or co-authored using generative AI tooling?

No.

…ow conversion

…on class

github-actions · 2026-02-04T01:08:11Z

JIRA Issue Information

=== Umbrella SPARK-55159 ===
Summary: Extract Arrow batch transformers from serializers for better composability
Assignee: None
Status: Open
Affected: ["4.2.0"]

This comment was automatically generated by GitHub Actions

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2026-02-04T02:56:15Z

python/pyspark/sql/pandas/serializers.py

+                        timezone=self._timezone,
+                        prefers_large_types=self._prefers_large_types,
+                    )
+                    if spark_type is not None


in what case will the spark_type be none?

There is no case now! all inputs are valid spark type. I initially did not want to make this change in this PR (just want to move the method out). I have now updated it.

zhengruifeng · 2026-02-04T02:58:19Z

python/pyspark/sql/pandas/serializers.py

-                        coerced_array = self._create_array(original_array, field.type)
-                        coerced_arrays.append(coerced_array)
+                    coerced_arrays = [
+                        ArrowBatchTransformer.cast_array(


should cast_array be in ArrowBatchTransformer?

is it for batch?

good catch! it is NOT for batch. I had todo comments. In POC, this method will be replaced by enforece_schema transformer. In this PR I wanted to move this method outside. Next I will make it a transformer.

I moved this out of transformer.

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2026-02-04T03:00:44Z

python/pyspark/sql/pandas/serializers.py

        int_to_decimal_coercion_enabled: bool = False,
        prefers_large_types: bool = False,
+        ignore_unexpected_complex_type_values: bool = False,
+        error_class: Optional[str] = None,


where is the error_class from?

it is needed for UDTF which requires a different error message. I added comments here. This can be cleaned when we move transform/convert logic out of serializers, then UDTF will be able to set the error message locally.

I changed this to have a is_udtf flag. The logic for UDTF cast error handling is quite different from other UDFs. See comments for details. We can align the logics in the future.

python/pyspark/sql/tests/test_conversion.py

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2026-02-05T03:05:14Z

also cc @gaogaotiantian please help review

Yicong-Huang · 2026-02-05T04:39:27Z

I want to include more changes. turn it into draft for now.

…tor/consolidate-pandas-to-arrow

gaogaotiantian · 2026-02-06T23:17:25Z

python/pyspark/sql/conversion.py

+                errorClass=error_class,
+                messageParameters={
+                    "expected": str(target_type),
+                    "actual": str(arr.type),


There's a strong assumption here that the template has expected and actual, which feels a bit weird to me.

that is unfortunately the current behavior on master. I would prefer to keep it the same for this PR, we can definitely improve it later.

gaogaotiantian · 2026-02-06T23:36:06Z

python/pyspark/sql/pandas/serializers.py

+                series_tuples: List[Tuple["pd.Series", DataType]] = [packed]
+            else:
+                # multiple UDF results: already iterable of tuples
+                series_tuples = list(packed)


We have a lot of random conversions to list - why is it preferred? I think tuple should be used when possible (or keep it what it is if conversion is unnecessary). Immutable objects are always better - including the input data - we should at least take either.

the callsites of this method (i.e., eval types wrappers in worker.py) currently return list. I agree we can refactor this, but the change would be too large to include in this PR. we can gradually change it when we move this logic out of serializer?

python/pyspark/sql/pandas/serializers.py

zhengruifeng · 2026-02-12T08:30:28Z

python/pyspark/sql/pandas/serializers.py

+        def create_batch(
+            packed: Union[
+                Tuple["pd.Series", DataType],
+                Tuple[Tuple["pd.Series", DataType], ...],


Can we clear define the input type and avoid the usage of Union here?

The mixture of input types is also the source of confusion

Addressed — extracted _normalize_packed helper to normalize the input upfront. Now create_batch always receives a uniform tuple-of-tuples form. Changing all the callsites (eval type wrappers in worker.py) to always produce uniform output is a larger change that would be out of scope for this PR. We'll extract those logic out to each eval type in the future.

zhengruifeng · 2026-02-12T08:35:32Z

python/pyspark/sql/conversion.py

+
+        Parameters
+        ----------
+        data : pd.DataFrame or list of pd.Series/pd.DataFrame


in what case the input is a list of DataFrames?

A list of DataFrames is used in stateful processing (e.g., applyInPandasWithState), where the batch contains multiple DataFrames representing different parts of the output (count, data, and state), each wrapped as a StructArray column. Updated the docstring to clarify this.

zhengruifeng · 2026-02-12T08:36:11Z

python/pyspark/sql/conversion.py

+            Whether to enable int to decimal coercion (default False)
+        ignore_unexpected_complex_type_values : bool
+            Whether to ignore unexpected complex type values in converter (default False)
+        is_udtf : bool


please add a TODO with JIRA to unify it in the future

Added TODO with SPARK-55502.

zhengruifeng · 2026-02-12T08:37:03Z

python/pyspark/sql/conversion.py

+
+        # Handle empty schema (0 columns)
+        # Use dummy column + select([]) to preserve row count (PyArrow limitation workaround)
+        if not schema.fields:


what does not schema.fields means? fields is None?

schema.fields is a list, so not schema.fields checks for an empty list (0 columns). Changed to len(schema.fields) == 0 for clarity.

zhengruifeng · 2026-02-12T08:40:16Z

python/pyspark/sql/pandas/serializers.py

+        def create_batch(
+            packed: Union[
+                Tuple["pa.Array", "pa.DataType"],
+                List[Tuple["pa.Array", "pa.DataType"]],


can we unify the input type to always match the multiple UDF cases?

Same as above — added a local normalize helper so create_batch always receives a list of (arr, type) tuples. Unifying the callsites to always produce the multi-UDF form will be done when we extract the logic to each eval type.

zhengruifeng · 2026-02-12T08:45:46Z

python/pyspark/sql/tests/test_conversion.py

+        # Basic DataFrame conversion
+        df = pd.DataFrame({"a": [1, 2, 3], "b": [1.0, 2.0, 3.0]})
+        schema = StructType([StructField("a", IntegerType()), StructField("b", DoubleType())])
+        result = PandasToArrowConversion.convert(df, schema)


I see you add a batch of tests here, is the convert method the final version?

The convert API is stable. Future PRs will focus on internal improvements (e.g., SPARK-55502 to eliminate is_udtf, elevating coerce_arrow_array to batch-level). The tests cover the core paths and will remain valid through those changes.

zhengruifeng · 2026-02-12T08:48:02Z

python/pyspark/sql/conversion.py


+# TODO: elevate to ArrowBatchTransformer and operate on full RecordBatch schema
+#       instead of per-column coercion.
+def coerce_arrow_array(


we have a ArrowArrayConversion, should this function in it?

There's already a TODO to elevate this to ArrowBatchTransformer to operate on full RecordBatch schema instead of per-column coercion. Moving it to ArrowArrayConversion first would just add an extra migration step. I'll address this when we do the batch-level coercion refactor.

zhengruifeng · 2026-02-12T08:50:07Z

python/pyspark/sql/conversion.py

+
+            mask = None if hasattr(series.array, "__arrow_array__") else series.isnull()
+
+            if is_udtf:


shouldn't the is_udtf handling be inside coerce_arrow_array?

The is_udtf special handling is in the from_pandas stage (catching broader ArrowException instead of just ArrowInvalid), not in the .cast() stage that coerce_arrow_array handles. So it fits better in series_to_array. This flag will be eliminated via SPARK-55502 when we unify UDTF and regular UDF conversion paths.

zhengruifeng · 2026-02-12T08:53:22Z

python/pyspark/sql/conversion.py

+                        )
+                    raise PySparkValueError(error_msg) from e
+
+        def convert_column(


can we consolidate convert_column and series_to_array?

zhengruifeng · 2026-02-12T08:54:28Z

python/pyspark/sql/conversion.py

+                    ignore_unexpected_complex_type_values=ignore_unexpected_complex_type_values,
+                    is_udtf=is_udtf,
+                )
+                return ArrowBatchTransformer.wrap_struct(nested_batch).column(0)


this line really takes me some seconds to remember what it does

Understand. Added a comment to clarify. We can revisit the wrap_struct transformer in the future.

…-arrow

…tor/consolidate-pandas-to-arrow # Conflicts: # python/pyspark/sql/pandas/serializers.py

gaogaotiantian

I have two very minor comments but overall it's good to me.

gaogaotiantian · 2026-02-18T21:03:12Z

python/pyspark/sql/conversion.py

+    @classmethod
+    def convert(
+        cls,
+        data: Union["pd.DataFrame", Sequence[Union["pd.Series", "pd.DataFrame"]]],


nit:
Should the type be Sequence[Union["pd.Series", "pd.DataFrame"]] or Union[Sequence["pd.Series"], Sequence["pd.DataFrame"]]?

it actually can support

a single data frame;

a sequence of series;

a sequence of data frames (each df will be wrapped into a single column).

We could further optimize this method in the future, but currently it needs to support all use cases.

Then type hint wise it's the latter one. Sequence[Union["pd.Series", "pd.DataFrame"]] means it supports a sequence of series or dataframes. aka [serie, dataframe]. The latter one means either sequence of series or sequence of dataframes, but not sequence of elements that can be either.

gaogaotiantian · 2026-02-18T21:06:44Z

python/pyspark/sql/conversion.py

+        # then extract columns as a list for uniform iteration.
+        columns: List[Union["pd.Series", "pd.DataFrame"]]
+        if isinstance(data, pd.DataFrame):
+            if assign_cols_by_name and any(isinstance(c, str) for c in data.columns):


If assign_cols_by_name is True but columns does not have name, what happens? Is the fallback behavior to ignore assign_cols_by_name expected?

it will fall back to use position (index) based column reference. This is intended, same as master.

HyukjinKwon · 2026-02-19T00:55:26Z

Merged to master.

Yicong-Huang added 4 commits February 3, 2026 15:46

refactor: introduce PandasBatchTransformer.to_arrow for pandas to Arr…

6470f2d

…ow conversion

refactor: use PandasBatchTransformer.to_arrow in serializers

750ac96

refactor: remove _create_struct_array and inline to_arrow calls

b26279b

refactor: consolidate conversion utilities into PandasToArrowConversi…

2c6dcd1

…on class

github-actions bot added SQL PYTHON labels Feb 4, 2026

Yicong-Huang changed the title ~~Consolidate pandas-to-Arrow conversion utilities in serializers~~ [SPARK-55159][PYTHON] Consolidate pandas-to-Arrow conversion utilities in serializers Feb 4, 2026

Yicong-Huang added 3 commits February 3, 2026 17:35

test: add more tests for dataframe_to_batch edge cases

632b4fc

doc: add comment

9a73dcb

refactor: simplify

df31ecf

zhengruifeng reviewed Feb 4, 2026

View reviewed changes

Yicong-Huang added 2 commits February 3, 2026 19:42

fix: take care of comments

549cb7e

fix: simplify

f6bb23e

Yicong-Huang requested a review from zhengruifeng February 4, 2026 04:48

Yicong-Huang added 5 commits February 3, 2026 22:34

fix: always return iterable (data, type)

4e27fbd

fix: consume generator

a1a2f15

fix: use tuple

8e4e1f4

fix: unwrap in serializer

c253a9e

fix: wrap for all cases

0122b4c

ueshin mentioned this pull request Feb 5, 2026

[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns #54144

Closed

fix: missing wraps and import

1d6dc86

Yicong-Huang marked this pull request as draft February 5, 2026 04:38

Yicong-Huang added 5 commits February 5, 2026 14:22

refactor: redesign api

77a4ce5

fix: test case catch

d6efd1f

Merge remote-tracking branch 'upstream/master' into SPARK-55159/refac…

d891fb6

…tor/consolidate-pandas-to-arrow

fix: use column name in error message

73d0318

fix: remove struct_in_pandas

39edd29

Yicong-Huang added 2 commits February 5, 2026 21:54

fix: error handling

87262a8

refactor: use is_udtf

d0e260d

Yicong-Huang marked this pull request as ready for review February 6, 2026 07:49

test: update test case

c4e8bab

gaogaotiantian reviewed Feb 6, 2026

View reviewed changes

Yicong-Huang added 3 commits February 6, 2026 16:12

fix: handle comments

bf56510

fix: format

e8aaddc

fix: mypy

30ea61a

Yicong-Huang force-pushed the SPARK-55159/refactor/consolidate-pandas-to-arrow branch from 7d30905 to 8549ab6 Compare February 8, 2026 23:37

merge upstream/master

a10f02f

Yicong-Huang force-pushed the SPARK-55159/refactor/consolidate-pandas-to-arrow branch from 8549ab6 to a10f02f Compare February 8, 2026 23:41

Yicong-Huang added 3 commits February 8, 2026 15:49

refactor: handle udtf separately

4925b72

refactor: simplify coerce_arrow_array

c1d8d9f

fix: use per type error message

4ee380a

zhengruifeng reviewed Feb 12, 2026

View reviewed changes

Yicong-Huang added 4 commits February 12, 2026 19:54

fix: address review comments

100a2b6

merge upstream/master

4d84f7b

fix: use Sequence for convert() param to fix mypy list invariance

a8d0ba4

Merge branch 'master' into SPARK-55159/refactor/consolidate-pandas-to…

842b8aa

…-arrow

Yicong-Huang requested review from gaogaotiantian and zhengruifeng February 13, 2026 01:11

Yicong-Huang added 2 commits February 13, 2026 01:12

refactor: consolidate series_to_array into convert_column

94a060d

Merge remote-tracking branch 'upstream/master' into SPARK-55159/refac…

20f7ea3

…tor/consolidate-pandas-to-arrow # Conflicts: # python/pyspark/sql/pandas/serializers.py

gaogaotiantian approved these changes Feb 18, 2026

View reviewed changes

fix: remove unused is_variant import after merge

4da884f

HyukjinKwon approved these changes Feb 19, 2026

View reviewed changes

HyukjinKwon closed this in d2a47b9 Feb 19, 2026

ueshin changed the title ~~[SPARK-55159][PYTHON] Consolidate pandas-to-Arrow conversion utilities in serializers~~ [SPARK-55349][PYTHON] Consolidate pandas-to-Arrow conversion utilities in serializers Feb 19, 2026


		mask = None if hasattr(series.array, "__arrow_array__") else series.isnull()

		if is_udtf:

Comments

Conversation

Yicong-Huang commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue Information

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

zhengruifeng commented Feb 5, 2026

Uh oh!

Yicong-Huang commented Feb 5, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 7, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang commented Feb 4, 2026 •

edited

Loading

github-actions bot commented Feb 4, 2026 •

edited

Loading

Yicong-Huang Feb 4, 2026 •

edited

Loading

Yicong-Huang Feb 7, 2026 •

edited

Loading

Yicong-Huang Feb 7, 2026 •

edited

Loading

gaogaotiantian Feb 18, 2026 •

edited

Loading