You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
[SPARK-53743][SS] Remove the usage of fetchWithArrow in ListState.put/appendList
### What changes were proposed in this pull request?
This PR proposes to remove the usage of fetchWithArrow in ListState.put/appendList.
(We don't remove the fetchWithArrow and its proto, since it does not remove noticeable complexity and removing something from proto may bring some unexpected side effect on compatibility.)
### Why are the changes needed?
We have observed the case where Arrow path of sending the list has some issue, while normal path does not have an issue.
The case is to have `None` value in IntegerType() in the element of list state - the column is set to nullable=True hence that should be allowed, but the error is raised during the conversion.
```
File "/databricks/spark/python/pyspark/sql/streaming/stateful_processor.py", line 147, in put
self._listStateClient.put(self._stateName, newState)
File "/databricks/spark/python/pyspark/sql/streaming/list_state_client.py", line 195, in put
self._stateful_processor_api_client._send_arrow_state(self.schema, values)
File "/spark/python/pyspark/sql/streaming/stateful_processor_api_client.py", line 604, in _send_arrow_state
pandas_df = convert_pandas_using_numpy_type(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/spark/python/pyspark/sql/pandas/types.py", line 1599, in convert_pandas_using_numpy_type
df[field.name] = df[field.name].astype(np_type)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/generic.py", line 6643, in astype
new_data = self._mgr.astype(dtype=dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 430, in astype
return self.apply(
^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/internals/managers.py", line 363, in apply
applied = getattr(b, f)(**kwargs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/internals/blocks.py", line 758, in astype
new_values = astype_array_safe(values, dtype, copy=copy, errors=errors)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 237, in astype_array_safe
new_values = astype_array(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 182, in astype_array
values = _astype_nansafe(values, dtype, copy=copy)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/python/lib/python3.12/site-packages/pandas/core/dtypes/astype.py", line 133, in _astype_nansafe
return arr.astype(dtype, copy=True)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^
TypeError: int() argument must be a string, a bytes-like object or a real number, not 'NoneType'
```
Since we don't know how useful the Arrow based sending list is, it'd be better not to try to fix the issue in the Arrow code path at this point and just remove it.
### Does this PR introduce _any_ user-facing change?
No.
### How was this patch tested?
Updated the existing test to test the observed case.
### Was this patch authored or co-authored using generative AI tooling?
No.
Closes#52479 from HeartSaVioR/SPARK-53743.
Authored-by: Jungtaek Lim <[email protected]>
Signed-off-by: Jungtaek Lim <[email protected]>
Copy file name to clipboardExpand all lines: sql/core/src/main/scala/org/apache/spark/sql/execution/python/streaming/TransformWithStateInPySparkStateServer.scala
+8Lines changed: 8 additions & 0 deletions
Original file line number
Diff line number
Diff line change
@@ -490,6 +490,10 @@ class TransformWithStateInPySparkStateServer(
0 commit comments