Skip to content

Comments

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362

Open
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55583/wrap-arrow-error-python-datasource
Open

[SPARK-55583][PYTHON] Validate Arrow schema types in Python data source#54362
Yicong-Huang wants to merge 3 commits intoapache:masterfrom
Yicong-Huang:SPARK-55583/wrap-arrow-error-python-datasource

Conversation

@Yicong-Huang
Copy link
Contributor

@Yicong-Huang Yicong-Huang commented Feb 18, 2026

What changes were proposed in this pull request?

This PR adds Arrow schema type validation for the pa.RecordBatch code path in Python data source reads. The fix adds a pa_schema.equals(first_element.schema) check after the existing column name validation in records_to_arrow_batches(), raising a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error with the expected and actual Arrow schemas.

Why are the changes needed?

When a Python data source returns a pa.RecordBatch with data types that don't match the declared schema, the resulting JVM-side errors are confusing and do not indicate the root cause. For example:

  • IllegalArgumentException: not all nodes, buffers and variadicBufferCounts were consumed from VectorLoader.load()
  • UnsupportedOperationException: Cannot call the method "getUTF8String" of ArrowColumnVector$ArrowVectorAccessor

These errors give no indication that the issue is a schema type mismatch in the Python data source's read() method.

Does this PR introduce any user-facing change?

Yes. Previously, returning a pa.RecordBatch with mismatched types from a Python data source would result in cryptic JVM errors. Now it raises a clear DATA_SOURCE_RETURN_SCHEMA_MISMATCH error showing the expected and actual Arrow schemas.

How was this patch tested?

Added a test case in test_python_datasource.py::test_arrow_batch_data_source.

Was this patch authored or co-authored using generative AI tooling?

No

@Yicong-Huang
Copy link
Contributor Author

cc @allisonwang-db

@Yicong-Huang Yicong-Huang changed the title [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source RecordBatch path [SPARK-55583][PYTHON] Validate Arrow schema types in Python data source Feb 18, 2026
@Yicong-Huang Yicong-Huang force-pushed the SPARK-55583/wrap-arrow-error-python-datasource branch from 3e695b7 to 1055807 Compare February 19, 2026 00:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant