[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns by Yicong-Huang · Pull Request #54144 · apache/spark

Yicong-Huang · 2026-02-04T20:19:19Z

What changes were proposed in this pull request?

This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in Spark Connect.

The issue occurs due to two PyArrow limitations:

pa.RecordBatch.from_arrays([], []) loses row count information
pa.Table.cast() on a 0-column table resets the row count to 0

Changes:

Handle 0-column pandas DataFrames separately using pa.Table.from_struct_array() to preserve row count
Skip the cast() operation for 0-column tables as it loses row count

Why are the changes needed?

Before this fix:

import pandas as pd
from pyspark.sql.types import StructType

pdf = pd.DataFrame(index=range(10))  # 10 rows, 0 columns
df = spark.createDataFrame(pdf, schema=StructType([]))
df.count()  # Returns 0 (wrong!)

After this fix:

df.count()  # Returns 10 (correct!)

Does this PR introduce any user-facing change?

Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Spark Connect.

How was this patch tested?

Added unit test test_from_pandas_dataframe_with_zero_columns in test_connect_creation.py

Was this patch authored or co-authored using generative AI tooling?

No

github-actions · 2026-02-04T20:19:29Z

JIRA Issue Information

=== Bug SPARK-55350 ===
Summary: Convert from pandas to arrow loses row count when schema has 0 columns
Assignee: None
Status: Open
Affected: ["4.1.0","4.2.0"]

This comment was automatically generated by GitHub Actions

Yicong-Huang · 2026-02-04T20:19:38Z

cc @ueshin

ueshin

Otherwise LGTM, pending tests.

ueshin · 2026-02-04T21:03:10Z

python/pyspark/sql/connect/session.py

+            # Handle the 0-column case separately to preserve row count.
+            if len(data.columns) == 0:
+                # For 0 rows, need explicit struct type; otherwise pa.array infers null type
+                if len(data) == 0:


nit: just wondering if we need to branch with len(data) == 0 here?

I feel it is necessary, the pa.array([{}] * 0) would become NullArray. I left comments in code as well

oh, I can combine them to

>>> pa.array([{}] * 0, type=pa.struct([])) <pyarrow.lib.StructArray object at 0x1115cf700> -- is_valid: all not null

Removed handling for 0-column case and simplified table creation.

Handle the case where the input DataFrame has no columns by creating an empty Arrow table with preserved row count.

ueshin · 2026-02-05T01:41:27Z

python/pyspark/sql/tests/connect/test_connect_creation.py

-        self.assertEqual(cdf.schema, sdf.schema)
+        self.assertEqual(cdf.schema, schema)
        self.assertEqual(cdf.count(), 5)
-        self.assertEqual(sdf.count(), 5)


Is this waiting for #54125? What's the plan to address it in classic?

yes, I will wait for refactoring PRs (#54125 is one of them). Then fix it in classic.

ueshin · 2026-02-05T01:50:12Z

Thanks! merging to master.

…rame from pandas with 0 columns ### What changes were proposed in this pull request? This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in **Spark Connect**. The issue occurs due to two PyArrow limitations: 1. `pa.RecordBatch.from_arrays([], [])` loses row count information 2. `pa.Table.cast()` on a 0-column table resets the row count to 0 **Changes:** 1. Handle 0-column pandas DataFrames separately using `pa.Table.from_struct_array()` to preserve row count 2. Skip the `cast()` operation for 0-column tables as it loses row count ### Why are the changes needed? Before this fix: ```python import pandas as pd from pyspark.sql.types import StructType pdf = pd.DataFrame(index=range(10)) # 10 rows, 0 columns df = spark.createDataFrame(pdf, schema=StructType([])) df.count() # Returns 0 (wrong!) ``` After this fix: ```python df.count() # Returns 10 (correct!) ``` ### Does this PR introduce _any_ user-facing change? Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Spark Connect. ### How was this patch tested? Added unit test `test_from_pandas_dataframe_with_zero_columns` in `test_connect_creation.py` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#54144 from Yicong-Huang/SPARK-55350/fix/arrow-zero-columns-row-count. Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com> Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com> Signed-off-by: Takuya Ueshin <ueshin@databricks.com>

fix: createDataframe with connect to preserve rows for empty columns

fd3a404

github-actions bot added SQL PYTHON CONNECT labels Feb 4, 2026

ueshin approved these changes Feb 4, 2026

View reviewed changes

Yicong-Huang added 2 commits February 4, 2026 13:45

Refactor table creation logic in session.py

0073895

Removed handling for 0-column case and simplified table creation.

Add handling for 0-column DataFrame in Arrow conversion

c43d039

Handle the case where the input DataFrame has no columns by creating an empty Arrow table with preserved row count.

HyukjinKwon approved these changes Feb 4, 2026

View reviewed changes

test: only test spark connect

cd6b82e

ueshin reviewed Feb 5, 2026

View reviewed changes

ueshin closed this in 73c3513 Feb 5, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Comments

[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns#54144

[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns#54144
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55350/fix/arrow-zero-columns-row-count

Yicong-Huang commented Feb 4, 2026

Uh oh!

github-actions bot commented Feb 4, 2026

Uh oh!

Yicong-Huang commented Feb 4, 2026

Uh oh!

ueshin left a comment

Uh oh!

ueshin Feb 4, 2026

Uh oh!

Yicong-Huang Feb 4, 2026

Uh oh!

Yicong-Huang Feb 4, 2026

Uh oh!

Yicong-Huang Feb 4, 2026

Uh oh!

ueshin Feb 5, 2026

Uh oh!

Yicong-Huang Feb 5, 2026

Uh oh!

ueshin commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Comments

Conversation

Yicong-Huang commented Feb 4, 2026

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

Uh oh!

github-actions bot commented Feb 4, 2026

JIRA Issue Information

Uh oh!

Yicong-Huang commented Feb 4, 2026

Uh oh!

ueshin left a comment

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 4, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

Yicong-Huang Feb 5, 2026

Choose a reason for hiding this comment

Uh oh!

ueshin commented Feb 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants