Skip to content

Conversation

H0TB0X420
Copy link
Contributor

Which issue does this PR close?

Closes #1227

Rationale for this change

(From original issue)
PyArrow is a massive dependency (>100MB unpacked) and the only required dependency for datafusion-python. Many Python Arrow libraries implement the PyCapsule Interface, allowing users to choose lightweight alternatives like nanoarrow (~7MB), arro3, or pass data directly from Polars, DuckDB, etc.

This PR implements the first phase of making PyArrow optional by updating input parameters to accept any Arrow-compatible library via the PyCapsule Interface.

What changes are included in this PR?

  • Add Protocol types for Arrow PyCapsule Interface ArrowSchemaExportable
  • Update schema parameters in register_csv, register_parquet, register_json, register_avro, register_listing_table, and read methods to accept ArrowSchemaExportable
  • Move pyarrow import to TYPE_CHECKING block (optional at runtime for type hints only)

Note: This PR covers input parameters only. Return types (ToPyArrow conversions) still reference pyarrow and will be addressed in a follow-up PR.

Are there any user-facing changes?

Breaking changes: None. All existing PyArrow usage continues to work.

New functionality: Users can now pass Arrow schemas from any library implementing __arrow_c_schema__() (nanoarrow, arro3, Polars, DuckDB, etc.) to datafusion methods.

Type hints: Schema parameters now show ArrowSchemaExportable | None instead of pa.Schema | None, but accept both.

- Add Protocol types for Arrow PyCapsule Interface
- Update schema parameters to accept any Arrow-compatible library
- Move pyarrow to TYPE_CHECKING (optional at runtime)
Copy link
Member

@kylebarron kylebarron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great start

We should add some tests to ensure that we can pass in another library's objects and it'll still work out of the box

@timsaucer
Copy link
Member

I'm not sure we can remove the dependency unless we do something about our auto-literal conversion. I suspect there are other places.

https://github.com/apache/datafusion-python/blob/main/python/datafusion/expr.py#L563

@kylebarron
Copy link
Member

Perhaps we should add support for the protocol without removing the pyarrow dependency yet

@H0TB0X420
Copy link
Contributor Author

I've updated the PR based on your feedback. PyArrow will stay required for now. That way people can at least use nanoarrow/Polars schemas, and we can tackle making PyArrow fully optional down the road.

All tests passing! Let me know if you'd like any other changes.

I am also happy to break this into smaller steps and handle the full optional PyArrow work in follow up issues.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Remove pyarrow as required dependency, relying on Arrow PyCapsule Interface

3 participants