Allow SessionContext.read_table to accept objects exposing __datafusion_table_provider__ (PyCapsule)
#1246
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
Rationale for this change
SessionContext.read_tablepreviously required adatafusion.catalog.Table(the PythonTablewrapper) and forwarded its.tablemember into the Rust binding. That meant objects that expose a__datafusion_table_provider__()API returning a PyCapsule (aTableProviderexported via the FFI) could not be passed directly toread_tableand instead had to be registered in the catalog first. This added an unnecessary registration round-trip and prevented ergonomic use of PyCapsule-backed/custom table providers.This PR makes
read_tableaccept either adatafusion.catalog.Tableor any Python object that implements__datafusion_table_provider__()and returns a properly-validated PyCapsule. The change removes the need to register a provider just to obtain aDataFrameand unifies the behavior with other places that already accept PyCapsule-backed providers.What changes are included in this PR?
High-level summary of the changes applied across Python and Rust layers:
Python documentation
docs/source/user-guide/io/table_provider.rst: documentSessionContext.read_table(provider)usage.Python bindings
python/datafusion/catalog.pyTable.__datafusion_table_provider__to expose the underlying PyCapsule from the PythonTablewrapper so it can be treated as a TableProvider-exportable object by other Python code.python/datafusion/context.pySessionContext.read_tabletyping and docstring to accept eitherTableor aTableProviderExportableobject (an object implementing__datafusion_table_provider__).Tableinstances and provider objects are supported.Rust core
src/utils.rsforeign_table_provider_from_capsuleandtry_table_provider_from_objecthelpers to centralize validation and extraction ofFFI_TableProviderfrom a PyCapsule and to convert it into anArc<dyn TableProvider>.src/catalog.rstry_table_provider_from_objectto detect and accept provider objects that expose__datafusion_table_provider__when registering tables into the catalog.PyTable::__datafusion_table_provider__soTablecan export anFFI_TableProviderPyCapsule (this is whatpython/catalog.pycalls through the Python layer).register_tableand schema provider lookup to prefer directPyTableextraction, thentry_table_provider_from_object, then fallback to constructing aDatasetas before.src/context.rsPySessionContext::register_tableto accept PyCapsule-backed provider objects by usingtry_table_provider_from_object.PySessionContext::read_tableto accept a genericPyAnybound and detect eitherPyTable(native, avoid FFI round-trip) or any object that exposes__datafusion_table_provider__. Returns an error if neither condition is met.src/udtf.rstry_table_provider_from_objectwhen calling Python table functions so UDTFs that return a provider object via__datafusion_table_provider__are accepted.Tests
python/tests/test_catalog.pytest_register_raw_table_without_capsuleto ensure rawRawTableobjects can be registered (monkeypatch ensures the capsule path is not invoked), queried, and deregistered.python/tests/test_context.pytest_read_table_accepts_table_providerto verifyctx.read_table(provider)works whenprovideris a PyCapsule-backed object, and thatctx.read_table(table)still works for regularTableobjects.uuid4import to module-level where appropriate).Other smaller maintenance changes: imports reorganized and some helper functions added to centralize PyCapsule validation and conversion.
Are these changes tested?
Yes — new unit tests have been added to validate the new behavior and to guard against regressions:
test_read_table_accepts_table_provider(inpython/tests/test_context.py) exercises reading from a registered provider and from a provider object directly.test_register_raw_table_without_capsule(inpython/tests/test_catalog.py) verifies raw table registration path does not trigger the capsule-based extraction and that queries against the registered table return expected results.Existing tests were left intact and the new tests exercise both the Python and Rust-side changes.
Are there any user-facing changes?
Yes — API behavior and documentation are updated:
SessionContext.read_tablenow accepts either adatafusion.catalog.Tableor any object that implements__datafusion_table_provider__()and returns adatafusion_table_providerPyCapsule. Users can now callctx.read_table(provider)on provider objects without registering them first.docs/source/user-guide/io/table_provider.rstshow the direct-use pattern viactx.read_table(provider).This is backwards-compatible: previously-accepted inputs (the Python
Tablewrapper andDataset-like objects) continue to work.No public API breaking changes were made to function signatures on the Rust side; changes are additive and focus on extending accepted input types and centralizing provider extraction logic.
Notes / Caveats
"datafusion_table_provider". Provider objects must implement__datafusion_table_provider__()that returns a PyCapsule with that name.PyTable(the native PythonTablewrapper) still exposes its provider via__datafusion_table_provider__(); however, the Rustread_tablepath prefers directPyTableusage to avoid unnecessary FFI round-trips when the object is already aRawTable.FFI_TableProvider::new(..., Some(runtime))call means the created FFI wrapper captures a Tokio runtime handle — ensure that embedding contexts keep compatible runtimes available.