Skip to content

Commit d0fbc4d

Browse files
icexellossBryanCutler
authored andcommitted
[SPARK-28003][PYTHON] Allow NaT values when creating Spark dataframe from pandas with Arrow
## What changes were proposed in this pull request? This patch removes `fillna(0)` when creating ArrowBatch from a pandas Series. With `fillna(0)`, the original code would turn a timestamp type into object type, which pyarrow will complain later: ``` >>> s = pd.Series([pd.NaT, pd.Timestamp('2015-01-01')]) >>> s.dtypes dtype('<M8[ns]') >>> s.fillna(0) 0 0 1 2015-01-01 00:00:00 dtype: object ``` ## How was this patch tested? Added `test_timestamp_nat` Closes apache#24844 from icexelloss/SPARK-28003-arrow-nat. Authored-by: Li Jin <[email protected]> Signed-off-by: Bryan Cutler <[email protected]>
1 parent 9df7587 commit d0fbc4d

File tree

2 files changed

+10
-2
lines changed

2 files changed

+10
-2
lines changed

python/pyspark/serializers.py

Lines changed: 1 addition & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -296,8 +296,7 @@ def create_array(s, t):
296296
mask = s.isnull()
297297
# Ensure timestamp series are in expected form for Spark internal representation
298298
if t is not None and pa.types.is_timestamp(t):
299-
s = _check_series_convert_timestamps_internal(s.fillna(0), self._timezone)
300-
299+
s = _check_series_convert_timestamps_internal(s, self._timezone)
301300
try:
302301
array = pa.Array.from_pandas(s, mask=mask, type=t, safe=self._safecheck)
303302
except pa.ArrowException as e:

python/pyspark/sql/tests/test_arrow.py

Lines changed: 9 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -383,6 +383,15 @@ def test_timestamp_dst(self):
383383
assert_frame_equal(pdf, df_from_python.toPandas())
384384
assert_frame_equal(pdf, df_from_pandas.toPandas())
385385

386+
# Regression test for SPARK-28003
387+
def test_timestamp_nat(self):
388+
dt = [pd.NaT, pd.Timestamp('2019-06-11'), None] * 100
389+
pdf = pd.DataFrame({'time': dt})
390+
df_no_arrow, df_arrow = self._createDataFrame_toggle(pdf)
391+
392+
assert_frame_equal(pdf, df_no_arrow.toPandas())
393+
assert_frame_equal(pdf, df_arrow.toPandas())
394+
386395
def test_toPandas_batch_order(self):
387396

388397
def delay_first_part(partition_index, iterator):

0 commit comments

Comments
 (0)