Skip to content

Comments

[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns#54144

Closed
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55350/fix/arrow-zero-columns-row-count
Closed

[SPARK-55350][PYTHON][CONNECT] Fix row count loss when creating DataFrame from pandas with 0 columns#54144
Yicong-Huang wants to merge 4 commits intoapache:masterfrom
Yicong-Huang:SPARK-55350/fix/arrow-zero-columns-row-count

Conversation

@Yicong-Huang
Copy link
Contributor

What changes were proposed in this pull request?

This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in Spark Connect.

The issue occurs due to two PyArrow limitations:

  1. pa.RecordBatch.from_arrays([], []) loses row count information
  2. pa.Table.cast() on a 0-column table resets the row count to 0

Changes:

  1. Handle 0-column pandas DataFrames separately using pa.Table.from_struct_array() to preserve row count
  2. Skip the cast() operation for 0-column tables as it loses row count

Why are the changes needed?

Before this fix:

import pandas as pd
from pyspark.sql.types import StructType

pdf = pd.DataFrame(index=range(10))  # 10 rows, 0 columns
df = spark.createDataFrame(pdf, schema=StructType([]))
df.count()  # Returns 0 (wrong!)

After this fix:

df.count()  # Returns 10 (correct!)

Does this PR introduce any user-facing change?

Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Spark Connect.

How was this patch tested?

Added unit test test_from_pandas_dataframe_with_zero_columns in test_connect_creation.py

Was this patch authored or co-authored using generative AI tooling?

No

@github-actions
Copy link

github-actions bot commented Feb 4, 2026

JIRA Issue Information

=== Bug SPARK-55350 ===
Summary: Convert from pandas to arrow loses row count when schema has 0 columns
Assignee: None
Status: Open
Affected: ["4.1.0","4.2.0"]


This comment was automatically generated by GitHub Actions

@Yicong-Huang
Copy link
Contributor Author

cc @ueshin

Copy link
Member

@ueshin ueshin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Otherwise LGTM, pending tests.

# Handle the 0-column case separately to preserve row count.
if len(data.columns) == 0:
# For 0 rows, need explicit struct type; otherwise pa.array infers null type
if len(data) == 0:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: just wondering if we need to branch with len(data) == 0 here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Image

I feel it is necessary, the pa.array([{}] * 0) would become NullArray. I left comments in code as well

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, I can combine them to

>>> pa.array([{}] * 0, type=pa.struct([]))
<pyarrow.lib.StructArray object at 0x1115cf700>
-- is_valid: all not null

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

changed

Removed handling for 0-column case and simplified table creation.
Handle the case where the input DataFrame has no columns by creating an empty Arrow table with preserved row count.
self.assertEqual(cdf.schema, sdf.schema)
self.assertEqual(cdf.schema, schema)
self.assertEqual(cdf.count(), 5)
self.assertEqual(sdf.count(), 5)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this waiting for #54125? What's the plan to address it in classic?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I will wait for refactoring PRs (#54125 is one of them). Then fix it in classic.

@ueshin
Copy link
Member

ueshin commented Feb 5, 2026

Thanks! merging to master.

@ueshin ueshin closed this in 73c3513 Feb 5, 2026
rpnkv pushed a commit to rpnkv/spark that referenced this pull request Feb 18, 2026
…rame from pandas with 0 columns

### What changes were proposed in this pull request?

This PR fixes the row count loss issue when creating a Spark DataFrame from a pandas DataFrame with 0 columns in **Spark Connect**.

The issue occurs due to two PyArrow limitations:
1. `pa.RecordBatch.from_arrays([], [])` loses row count information
2. `pa.Table.cast()` on a 0-column table resets the row count to 0

**Changes:**
1. Handle 0-column pandas DataFrames separately using `pa.Table.from_struct_array()` to preserve row count
2. Skip the `cast()` operation for 0-column tables as it loses row count

### Why are the changes needed?

Before this fix:
```python
import pandas as pd
from pyspark.sql.types import StructType

pdf = pd.DataFrame(index=range(10))  # 10 rows, 0 columns
df = spark.createDataFrame(pdf, schema=StructType([]))
df.count()  # Returns 0 (wrong!)
```

After this fix:
```python
df.count()  # Returns 10 (correct!)
```

### Does this PR introduce _any_ user-facing change?

Yes. Creating a DataFrame from a pandas DataFrame with 0 columns now correctly preserves the row count in Spark Connect.

### How was this patch tested?

Added unit test `test_from_pandas_dataframe_with_zero_columns` in `test_connect_creation.py`

### Was this patch authored or co-authored using generative AI tooling?

No

Closes apache#54144 from Yicong-Huang/SPARK-55350/fix/arrow-zero-columns-row-count.

Lead-authored-by: Yicong-Huang <17627829+Yicong-Huang@users.noreply.github.com>
Co-authored-by: Yicong Huang <17627829+Yicong-Huang@users.noreply.github.com>
Signed-off-by: Takuya Ueshin <ueshin@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants