Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 18, 2025

Which issue does this PR close?


Rationale for this change

This change consolidates and modernizes table provider registration in DataFusion's Python bindings. Previously, there were multiple ad-hoc mechanisms (register_table_provider, Table.from_view(), direct Table or pycapsule usage) that led to confusing APIs, inconsistent behaviors, and fragmented documentation.

This PR introduces a clean, centralized approach using the high-level Table wrapper class and a normalization layer that supports multiple table provider inputs, including:

  • Native DataFusion Table objects
  • FFI-exported providers via pycapsules
  • PyArrow Datasets
  • DataFrame views

By consolidating registration into SessionContext.register_table() and extending Schema.register_table() to match, we simplify the user experience, reduce internal complexity, and align the API more closely with Pythonic expectations.


What changes are included in this PR?

High-level Summary

  • Introduces a new high-level Python API: datafusion.Table

    • Supports .from_capsule(), .from_dataframe(), and .from_dataset()
  • Deprecates SessionContext.register_table_provider() in favor of register_table()

  • Deprecates Table.from_view() in favor of DataFrame.into_view() and Table.from_dataframe()

  • Updates Schema.register_table() to support any object implementing __datafusion_table_provider__ and pyarrow.dataset.Dataset

  • Adds _normalize_table_provider utility to coerce supported input types

  • Centralizes coercion logic in Rust with coerce_table_provider and table_provider_from_pycapsule()

  • Enhances documentation and examples to reflect modern registration idioms

  • Improves test coverage for new usage patterns and coercion logic

  • Introduces datafusion.EXPECTED_PROVIDER_MSG for stable, testable error messages


Are these changes tested?

Yes. This PR includes comprehensive test coverage:

  • Unit tests for new Table methods and error handling

  • Integration tests verifying:

    • Registering with Table.from_dataframe(), from_capsule(), and into_view()
    • Registering pyarrow.dataset.Dataset objects
    • Deprecated paths emit DeprecationWarning
    • SessionContext and Schema registration paths behave identically
    • Custom objects exporting __datafusion_table_provider__ can be used directly
    • Proper error messages are raised when passing invalid types (e.g., DataFrame without conversion)

Are there any user-facing changes?

✅ Additions

  • New public API:

    • datafusion.Table

      • Table.from_dataframe(df)
      • Table.from_capsule(capsule)
      • Table.from_dataset(dataset)
  • DataFrame.into_view() — recommended way to convert to a table provider

  • datafusion.EXPECTED_PROVIDER_MSG — stable constant for validation errors

  • Schema.register_table(...) now accepts all supported inputs (like SessionContext.register_table)

⚠️ Deprecations

  • SessionContext.register_table_provider(...) is deprecated

    • Emits a warning and forwards to register_table
  • Table.from_view() is deprecated

    • Emits a DeprecationWarning; use into_view() or from_dataframe() instead

📋 Documentation & Examples

  • Updated user guide examples to use Table and register_table
  • Clarified deprecation in docstrings and code examples
  • Added inline guidance on converting DataFrame objects

🔁 Compatibility

  • Fully backwards compatible

    • Deprecated methods still function and emit warnings
  • Existing table registration logic continues to work as expected

  • Encourages migration to the new Table API for consistency and future-proofing


Breaking changes?

No. This is a non-breaking refactor that preserves all existing behaviors through shims and deprecation paths. However, users relying on internal or undocumented APIs (e.g., raw table objects or bypassing coercion) may encounter changes.

docs/tests, add DataFrame view support, and improve Send/concurrency
support.

migrates the codebase from using `Table` to a
`TableProvider`-based API, refactors registration and access paths to
simplify catalog/context interactions, and updates documentation and
examples. DataFrame view handling is improved (`into_view` is now
public), the test-suite is expanded to cover new registration and async
SQL scenarios, and `TableProvider` now supports the `Send` trait across
modules for safer concurrency. Minor import cleanup and utility
adjustments (including a refined `pyany_to_table_provider`) are
included.
DataFrame→TableProvider conversion, plus tests and FFI/pycapsule
improvements.

-- Registration logic & API

* Refactor of table provider registration logic for improved clarity and
  simpler call sites.
* Remove PyTableProvider registration from an internal module (reduces
  surprising side effects).
* Update table registration method to call `register_table` instead of
  `register_table_provider`.
* Extend `register_table` to support `TableProviderExportable` so more
  provider types can be registered uniformly.
* Improve error messages related to registration failures (missing
  PyCapsule name and DataFrame registration errors).

-- DataFrame ↔ TableProvider conversions

* Introduce utility functions to simplify table provider conversions and
  centralize conversion logic.
* Rename `into_view_provider` → `to_view_provider` for clearer intent.
* Fix `from_dataframe` to return the correct type and update
  `DataFrame.into_view` to import the correct `TableProvider`.
* Remove an obsolete `dataframe_into_view` test case after the refactor.

-- FFI / PyCapsule handling

* Update `FFI_TableProvider` initialization to accept an optional
  parameter (improves FFI ergonomics).
* Introduce `table_provider_from_pycapsule` utility to standardize
  pycapsule-based construction.
* Improve the error message when a PyCapsule name is missing to help
  debugging.

-- DeltaTable & specific integrations

* Update TableProvider registration for `DeltaTable` to use the correct
  registration method (matches the new API surface).

-- Tests, docs & minor fixes

* Add tests for registering a `TableProvider` from a `DataFrame` and
  from a capsule to ensure conversion paths are covered.
* Fix a typo in the `register_view` docstring and another typo in the
  error message for unsupported volatility type.
* Simplify version retrieval by removing exception handling around
  `PackageNotFoundError` (streamlines code path).
* Removed unused helpers (`extract_table_provider`, `_wrap`) and dead code to simplify maintenance.
* Consolidated and streamlined table-provider extraction and registration logic; improved error handling and replaced a hardcoded error message with `EXPECTED_PROVIDER_MSG`.
* Marked `from_view` as deprecated; updated deprecation message formatting and adjusted the warning `stacklevel` so it points to caller code.
* Removed the `Send` marker from TableProvider trait objects to increase type flexibility — review threading assumptions.
* Added type hints to `register_schema` and `deregister_table` methods.
* Adjusted tests and exceptions (e.g., changed one test to expect `RuntimeError`) and updated test coverage accordingly.
* Introduced a refactored `TableProvider` class and enhanced Python integration by adding support for extracting `PyDataFrame` in `PySchema`.

Notes:

* Consumers should migrate away from `TableProvider::from_view` to the new TableProvider integration.
* Audit any code relying on `Send` for trait objects passed across threads.
* Update downstream tests and documentation to reflect the changed exception types and deprecation.
utilities, docs, and robustness fixes

* Normalized table-provider handling and simplified registration flow
  across the codebase; multiple commits centralize provider coercion and
normalization.
* Introduced utility helpers (`coerce_table_provider`,
  `extract_table_provider`, `_normalize_table_provider`) to centralize
extraction, error handling, and improve clarity.
* Simplified `from_dataframe` / `into_view` behavior: clearer
  implementations, direct returns of DataFrame views where appropriate,
and added internal tests for DataFrame flows.
* Fixed DataFrame registration semantics: enforce `TypeError` for
  invalid registrations; added handling for `DataFrameWrapper` by
converting it to a view.
* Added tests, including a schema registration test using a PyArrow
  dataset and internal DataFrame tests to cover new flows.
* Documentation improvements: expanded `from_dataframe` docstrings with
  parameter details, added usage examples for `into_view`, and
documented deprecations (e.g., `register_table_provider` →
`register_table`).
* Warning and UX fixes: synchronized deprecation `stacklevel` so
  warnings point to caller code; improved `__dir__` to return sorted,
unique attributes.
* Cleanup: removed unused imports (including an unused error import from
  `utils.rs`) and other dead code to reduce noise.
@kosiew kosiew force-pushed the table-provider-1239 branch from c47b0f1 to ea2973c Compare September 18, 2025 09:47
@kosiew kosiew force-pushed the table-provider-1239 branch from ea2973c to 1872a7f Compare September 18, 2025 09:51
@kosiew kosiew marked this pull request as ready for review September 20, 2025 06:17
Copy link
Member

@timsaucer timsaucer left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an incredible start!

From a naming perspective I think it's more intuitive to just call these Table instead of TableProvider. I know we have a Table class in datafusion.catalog. It feels this is is a real opportunity to give the user a more unified experience even further.

If we are going to be making big changes like this and deprecating some functions, then I really want to make sure we give an extremely pleasant end user experience.

Comment on lines 28 to 31
**Deprecations:**

- Document that `SessionContext.register_table_provider` is deprecated in favor of `SessionContext.register_table`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this change.

Comment on lines 100 to 103
# Skip private members that start with underscore to avoid duplication
if name.split(".")[-1].startswith("_") and what in ("data", "variable"):
skip = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can understand better, why do we need both this rule and the one above in lines 86-88?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The explicit skip_contents list handles targeted, known problem cases (re-exports, specific deprecated APIs, or particular items that cause duplication or confusion). It’s precise and intentional.
  • The private-name filter is a broad rule to remove many small implementation details (module-level private constants) without listing them all manually. This prevents the docs from listing every private variable.

I'll also add clarifying comments in autoapi_skip_member_fn

Comment on lines 158 to 159
provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__())
ctx.register_table("my_delta_table", provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a worse experience than before. Why can we not just call register_table("my_delta_table", delta_table)?

Comment on lines 71 to 82
capsule = provider.__datafusion_table_provider__()
capsule_provider = TableProvider.from_capsule(capsule)
df = ctx.from_pydict({"a": [1]})
view_provider = TableProvider.from_dataframe(df)
# or: view_provider = df.into_view()
ctx.register_table("capsule_table", capsule_provider)
ctx.register_table("view_table", view_provider)
ctx.table("capsule_table").show()
ctx.table("view_table").show()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example takes a bit of cognitive load to understand what we're demonstrating.

First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to TableProvider or register_table without having to think about or understand these mechanics behind the scene.

Comment on lines 24 to 25
# isort: skip_file # Prevent import-sorting linter errors (I001)
# ruff: noqa: I001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ruff lint causing a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove them.

Comment on lines 321 to 322
This is the preferred way to obtain a view for
:py:meth:`~datafusion.context.SessionContext.register_table`.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this statement.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Here are the reasons:
1. Direct API: Most efficient path - directly calls the underlying Rust
DataFrame.into_view() method without intermediate delegations.
2. Clear semantics: The into_ prefix follows Rust conventions,
indicating conversion from one type to another.
3. Canonical method: Other approaches like TableProvider.from_dataframe
delegate to this method internally, making this the single source of truth.
4. Deprecated alternatives: The older TableProvider.from_view helper
is deprecated and issues warnings when used.

I will add the above to the comment in def to_view too

>>> from datafusion import SessionContext
>>> ctx = SessionContext()
>>> df = ctx.sql("SELECT 1 AS value")
>>> provider = df.into_view()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an end user's perspective, they turn a dataframe into a view, which they then register so they can use it later. I don't think this end user needs to understand the concept of TableProvider at all. In the example I would change the variable name provider to view

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes sense, given that we're moving away from 'provider'

@kosiew kosiew force-pushed the table-provider-1239 branch from 245e89f to 918b1ce Compare September 22, 2025 10:16
@kosiew
Copy link
Contributor Author

kosiew commented Sep 23, 2025

From a naming perspective I think it's more intuitive to just call these Table instead of TableProvider. I know we have a Table class...

I removed TableProvider class in Python.
Instead catalog.Table (can be a Table or an object that implements datafusion_table_provider())

@kosiew kosiew requested a review from timsaucer September 23, 2025 01:09
@kosiew kosiew changed the title Introduce TableProvider wrapper & unified register_table API; deprecate register_table_provider Introduce Table wrapper, unify table registration via register_table; deprecate legacy APIs Sep 23, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

A single common PyTableProvider that can be created either via a pycapsule or into_view

2 participants