Skip to content
Open
Show file tree
Hide file tree
Changes from 10 commits
Commits
Show all changes
23 commits
Select commit Hold shift + click to select a range
9b4f144
Migrate Table → TableProvider; refactor registration and access, update
kosiew Sep 15, 2025
3da3f93
Refactors and bug fixes around TableProvider registration and
kosiew Sep 16, 2025
512442b
TableProvider refactor & PyDataFrame integration
kosiew Sep 16, 2025
a8275dc
Normalize & simplify TableProvider/DataFrame registration; add
kosiew Sep 16, 2025
00bd445
refactor: update documentation for DataFrame to Table Provider conver…
kosiew Sep 18, 2025
6869919
refactor: replace to_view_provider with inner_df for DataFrame access
kosiew Sep 18, 2025
6e46d43
refactor: streamline TableProvider creation from DataFrame by consoli…
kosiew Sep 18, 2025
38af2b5
Merge branch 'main' into table-provider-1239
kosiew Sep 18, 2025
1872a7f
fix ruff errors
kosiew Sep 18, 2025
5948fb4
refactor: enhance autoapi_skip_member_fn to skip private variables an…
kosiew Sep 18, 2025
b9851d8
revert main 49.0.0 md
kosiew Sep 22, 2025
586c2cf
refactor: add comment in autoapi_skip_member_fn
kosiew Sep 22, 2025
d4ff136
refactor: remove isort and ruff comments to clean up import section
kosiew Sep 22, 2025
29203c6
docs: enhance docstring for DataFrame.into_view method to clarify usa…
kosiew Sep 22, 2025
ae8c1dd
docs: update example in DataFrame.into_view docstring for clarity
kosiew Sep 22, 2025
0c5eb17
docs: update example for registering Delta Lake tables to simplify usage
kosiew Sep 22, 2025
f9a3a22
docs: update table provider documentation for clarity and deprecate o…
kosiew Sep 22, 2025
f930181
docs: update documentation to reflect removal of TableProvider and us…
kosiew Sep 22, 2025
afc9b4e
remove TableProvider in Python, update missing_exports function, doc
kosiew Sep 22, 2025
918b1ce
Fix Ruff errors
kosiew Sep 22, 2025
93f0a31
Refactor test_table_loading to use Table instead of TableProvider
kosiew Sep 22, 2025
7bc303d
Refactor aggregate tests to simplify result assertions and improve re…
kosiew Sep 22, 2025
4429614
Add comments to clarify table normalization in aggregate tests
kosiew Sep 22, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions dev/changelog/49.0.0.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,10 @@ This release consists of 16 commits from 7 contributors. See credits at the end

- fix(build): Include build.rs in published crates [#1199](https://github.com/apache/datafusion-python/pull/1199) (colinmarc)

**Deprecations:**

- Document that `SessionContext.register_table_provider` is deprecated in favor of `SessionContext.register_table`.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will revert this change.

**Other:**

- 48.0.0 Release [#1175](https://github.com/apache/datafusion-python/pull/1175) (timsaucer)
Expand Down
7 changes: 7 additions & 0 deletions docs/source/conf.py
Original file line number Diff line number Diff line change
Expand Up @@ -83,6 +83,9 @@ def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa
# Duplicate modules (skip module-level docs to avoid duplication)
("module", "datafusion.col"),
("module", "datafusion.udf"),
# Private variables causing duplicate documentation
("data", "datafusion.utils._PYARROW_DATASET_TYPES"),
("variable", "datafusion.utils._PYARROW_DATASET_TYPES"),
# Deprecated
("class", "datafusion.substrait.serde"),
("class", "datafusion.substrait.plan"),
Expand All @@ -94,6 +97,10 @@ def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa
if (what, name) in skip_contents:
skip = True

# Skip private members that start with underscore to avoid duplication
if name.split(".")[-1].startswith("_") and what in ("data", "variable"):
skip = True

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So I can understand better, why do we need both this rule and the one above in lines 86-88?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The explicit skip_contents list handles targeted, known problem cases (re-exports, specific deprecated APIs, or particular items that cause duplication or confusion). It’s precise and intentional.
  • The private-name filter is a broad rule to remove many small implementation details (module-level private constants) without listing them all manually. This prevents the docs from listing every private variable.

I'll also add clarifying comments in autoapi_skip_member_fn

return skip


Expand Down
2 changes: 1 addition & 1 deletion docs/source/contributor-guide/ffi.rst
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ as performant as possible and to utilize the features of DataFusion, you may dec
your source in Rust and then expose it through `PyO3 <https://pyo3.rs>`_ as a Python library.

At first glance, it may appear the best way to do this is to add the ``datafusion-python``
crate as a dependency, provide a ``PyTable``, and then to register it with the
crate as a dependency, produce a DataFusion table in Rust, and then register it with the
``SessionContext``. Unfortunately, this will not work.

When you produce your code as a Python library and it needs to interact with the DataFusion
Expand Down
13 changes: 11 additions & 2 deletions docs/source/user-guide/data-sources.rst
Original file line number Diff line number Diff line change
Expand Up @@ -152,13 +152,22 @@ as Delta Lake. This will require a recent version of
.. code-block:: python

from deltalake import DeltaTable
from datafusion import TableProvider

delta_table = DeltaTable("path_to_table")
ctx.register_table_provider("my_delta_table", delta_table)
provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__())
ctx.register_table("my_delta_table", provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feels like a worse experience than before. Why can we not just call register_table("my_delta_table", delta_table)?

df = ctx.table("my_delta_table")
df.show()

On older versions of ``deltalake`` (prior to 0.22) you can use the
.. note::

:py:meth:`~datafusion.context.SessionContext.register_table_provider` is
deprecated. Use
:py:meth:`~datafusion.context.SessionContext.register_table` with a
:py:class:`~datafusion.TableProvider` instead.

On older versions of ``deltalake`` (prior to 0.22) you can use the
`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_
interface to import to DataFusion, but this does not support features such as filter push down
which can lead to a significant performance difference.
Expand Down
41 changes: 34 additions & 7 deletions docs/source/user-guide/io/table_provider.rst
Original file line number Diff line number Diff line change
Expand Up @@ -39,20 +39,47 @@ A complete example can be found in the `examples folder <https://github.com/apac
) -> PyResult<Bound<'py, PyCapsule>> {
let name = CString::new("datafusion_table_provider").unwrap();

let provider = Arc::new(self.clone())
.map_err(|e| PyRuntimeError::new_err(e.to_string()))?;
let provider = FFI_TableProvider::new(Arc::new(provider), false);
let provider = Arc::new(self.clone());
let provider = FFI_TableProvider::new(provider, false, None);

PyCapsule::new_bound(py, provider, Some(name.clone()))
}
}

Once you have this library available, in python you can register your table provider
to the ``SessionContext``.
Once you have this library available, you can construct a
:py:class:`~datafusion.TableProvider` in Python and register it with the
``SessionContext``. Table providers can be created either from the PyCapsule exposed by
your Rust provider or from an existing :py:class:`~datafusion.dataframe.DataFrame`.
Call the provider's ``__datafusion_table_provider__()`` method to obtain the capsule
before constructing a ``TableProvider``. The ``TableProvider.from_view()`` helper is
deprecated; instead use ``TableProvider.from_dataframe()`` or ``DataFrame.into_view()``.

.. note::

:py:meth:`~datafusion.context.SessionContext.register_table_provider` is
deprecated. Use
:py:meth:`~datafusion.context.SessionContext.register_table` with the
resulting :py:class:`~datafusion.TableProvider` instead.

.. code-block:: python

from datafusion import SessionContext, TableProvider

ctx = SessionContext()
provider = MyTableProvider()
ctx.register_table_provider("my_table", provider)

ctx.table("my_table").show()
capsule = provider.__datafusion_table_provider__()
capsule_provider = TableProvider.from_capsule(capsule)

df = ctx.from_pydict({"a": [1]})
view_provider = TableProvider.from_dataframe(df)
# or: view_provider = df.into_view()

ctx.register_table("capsule_table", capsule_provider)
ctx.register_table("view_table", view_provider)

ctx.table("capsule_table").show()
ctx.table("view_table").show()
Comment on lines 71 to 82
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This example takes a bit of cognitive load to understand what we're demonstrating.

First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to TableProvider or register_table without having to think about or understand these mechanics behind the scene.


Both ``TableProvider.from_capsule()`` and ``TableProvider.from_dataframe()`` create
table providers that can be registered with the SessionContext using ``register_table()``.
Original file line number Diff line number Diff line change
Expand Up @@ -53,7 +53,7 @@ def test_ffi_table_function_call_directly():
table_udtf = udtf(table_func, "my_table_func")

my_table = table_udtf()
ctx.register_table_provider("t", my_table)
ctx.register_table("t", my_table)
result = ctx.table("t").collect()

assert len(result) == 2
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -18,14 +18,16 @@
from __future__ import annotations

import pyarrow as pa
from datafusion import SessionContext
from datafusion import SessionContext, TableProvider
from datafusion_ffi_example import MyTableProvider


def test_table_loading():
ctx = SessionContext()
table = MyTableProvider(3, 2, 4)
ctx.register_table_provider("t", table)
ctx.register_table(
"t", TableProvider.from_capsule(table.__datafusion_table_provider__())
)
result = ctx.table("t").collect()

assert len(result) == 4
Expand Down
20 changes: 11 additions & 9 deletions python/datafusion/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,24 +21,26 @@
See https://datafusion.apache.org/python for more information.
"""

# isort: skip_file # Prevent import-sorting linter errors (I001)
# ruff: noqa: I001
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this ruff lint causing a problem?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove them.


from __future__ import annotations

from typing import Any

try:
import importlib.metadata as importlib_metadata
except ImportError:
import importlib_metadata
import importlib_metadata # type: ignore[import]

# Public submodules
from . import functions, object_store, substrait, unparser

# The following imports are okay to remain as opaque to the user.
from ._internal import Config
from ._internal import Config, EXPECTED_PROVIDER_MSG
from .catalog import Catalog, Database, Table
from .col import col, column
from .common import (
DFSchema,
)
from .common import DFSchema
from .context import (
RuntimeEnvBuilder,
SessionConfig,
Expand All @@ -47,13 +49,11 @@
)
from .dataframe import DataFrame, ParquetColumnOptions, ParquetWriterOptions
from .dataframe_formatter import configure_formatter
from .expr import (
Expr,
WindowFrame,
)
from .expr import Expr, WindowFrame
from .io import read_avro, read_csv, read_json, read_parquet
from .plan import ExecutionPlan, LogicalPlan
from .record_batch import RecordBatch, RecordBatchStream
from .table_provider import TableProvider
from .user_defined import (
Accumulator,
AggregateUDF,
Expand All @@ -69,6 +69,7 @@
__version__ = importlib_metadata.version(__name__)

__all__ = [
"EXPECTED_PROVIDER_MSG",
"Accumulator",
"AggregateUDF",
"Catalog",
Expand All @@ -90,6 +91,7 @@
"SessionContext",
"Table",
"TableFunction",
"TableProvider",
"WindowFrame",
"WindowUDF",
"catalog",
Expand Down
36 changes: 27 additions & 9 deletions python/datafusion/catalog.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,14 @@
from typing import TYPE_CHECKING, Protocol

import datafusion._internal as df_internal
from datafusion.utils import _normalize_table_provider

if TYPE_CHECKING:
import pyarrow as pa

from datafusion import TableProvider
from datafusion.context import TableProviderExportable

try:
from warnings import deprecated # Python 3.13+
except ImportError:
Expand Down Expand Up @@ -82,7 +86,11 @@ def database(self, name: str = "public") -> Schema:
"""Returns the database with the given ``name`` from this catalog."""
return self.schema(name)

def register_schema(self, name, schema) -> Schema | None:
def register_schema(
self,
name: str,
schema: Schema | SchemaProvider | SchemaProviderExportable,
) -> Schema | None:
"""Register a schema with this catalog."""
if isinstance(schema, Schema):
return self.catalog.register_schema(name, schema._raw_schema)
Expand Down Expand Up @@ -122,11 +130,16 @@ def table(self, name: str) -> Table:
"""Return the table with the given ``name`` from this schema."""
return Table(self._raw_schema.table(name))

def register_table(self, name, table) -> None:
"""Register a table provider in this schema."""
if isinstance(table, Table):
return self._raw_schema.register_table(name, table.table)
return self._raw_schema.register_table(name, table)
def register_table(
self, name: str, table: Table | TableProvider | TableProviderExportable
) -> None:
"""Register a table or table provider in this schema.

Objects implementing ``__datafusion_table_provider__`` are also supported
and treated as :class:`TableProvider` instances.
"""
provider = _normalize_table_provider(table)
return self._raw_schema.register_table(name, provider)

def deregister_table(self, name: str) -> None:
"""Deregister a table provider from this schema."""
Expand Down Expand Up @@ -219,14 +232,19 @@ def table(self, name: str) -> Table | None:
"""Retrieve a specific table from this schema."""
...

def register_table(self, name: str, table: Table) -> None: # noqa: B027
"""Add a table from this schema.
def register_table( # noqa: B027
self, name: str, table: Table | TableProvider | TableProviderExportable
) -> None:
"""Add a table to this schema.

This method is optional. If your schema provides a fixed list of tables, you do
not need to implement this method.

Objects implementing ``__datafusion_table_provider__`` are also supported
and treated as :class:`TableProvider` instances.
"""

def deregister_table(self, name, cascade: bool) -> None: # noqa: B027
def deregister_table(self, name: str, cascade: bool) -> None: # noqa: B027
"""Remove a table from this schema.

This method is optional. If your schema provides a fixed list of tables, you do
Expand Down
55 changes: 42 additions & 13 deletions python/datafusion/context.py
Original file line number Diff line number Diff line change
Expand Up @@ -29,11 +29,11 @@

import pyarrow as pa

from datafusion.catalog import Catalog, CatalogProvider, Table
from datafusion.catalog import Catalog
from datafusion.dataframe import DataFrame
from datafusion.expr import SortKey, sort_list_to_raw_sort_list
from datafusion.expr import sort_list_to_raw_sort_list
from datafusion.record_batch import RecordBatchStream
from datafusion.user_defined import AggregateUDF, ScalarUDF, TableFunction, WindowUDF
from datafusion.utils import _normalize_table_provider

from ._internal import RuntimeEnvBuilder as RuntimeEnvBuilderInternal
from ._internal import SessionConfig as SessionConfigInternal
Expand All @@ -48,7 +48,16 @@
import pandas as pd
import polars as pl # type: ignore[import]

from datafusion import TableProvider
from datafusion.catalog import CatalogProvider, Table
from datafusion.expr import SortKey
from datafusion.plan import ExecutionPlan, LogicalPlan
from datafusion.user_defined import (
AggregateUDF,
ScalarUDF,
TableFunction,
WindowUDF,
)


class ArrowStreamExportable(Protocol):
Expand Down Expand Up @@ -733,7 +742,7 @@ def from_polars(self, data: pl.DataFrame, name: str | None = None) -> DataFrame:
# https://github.com/apache/datafusion-python/pull/1016#discussion_r1983239116
# is the discussion on how we arrived at adding register_view
def register_view(self, name: str, df: DataFrame) -> None:
"""Register a :py:class: `~datafusion.detaframe.DataFrame` as a view.
"""Register a :py:class:`~datafusion.dataframe.DataFrame` as a view.

Args:
name (str): The name to register the view under.
Expand All @@ -742,16 +751,29 @@ def register_view(self, name: str, df: DataFrame) -> None:
view = df.into_view()
self.ctx.register_table(name, view)

def register_table(self, name: str, table: Table) -> None:
"""Register a :py:class: `~datafusion.catalog.Table` as a table.
def register_table(
self, name: str, table: Table | TableProvider | TableProviderExportable
) -> None:
"""Register a Table or TableProvider.

The registered table can be referenced from SQL statement executed against.
The registered table can be referenced from SQL statements executed against
this context.

Plain :py:class:`~datafusion.dataframe.DataFrame` objects are not supported;
convert them first with :meth:`datafusion.dataframe.DataFrame.into_view` or
:meth:`datafusion.TableProvider.from_dataframe`.

Objects implementing ``__datafusion_table_provider__`` are also supported
and treated as :py:class:`~datafusion.TableProvider` instances.

Args:
name: Name of the resultant table.
table: DataFusion table to add to the session context.
table: DataFusion :class:`Table`, :class:`TableProvider`, or any object
implementing ``__datafusion_table_provider__`` to add to the session
context.
"""
self.ctx.register_table(name, table.table)
provider = _normalize_table_provider(table)
self.ctx.register_table(name, provider)

def deregister_table(self, name: str) -> None:
"""Remove a table from the session."""
Expand All @@ -771,14 +793,21 @@ def register_catalog_provider(
self.ctx.register_catalog_provider(name, provider)

def register_table_provider(
self, name: str, provider: TableProviderExportable
self, name: str, provider: Table | TableProvider | TableProviderExportable
) -> None:
"""Register a table provider.

This table provider must have a method called ``__datafusion_table_provider__``
which returns a PyCapsule that exposes a ``FFI_TableProvider``.
Deprecated: use :meth:`register_table` instead.

Objects implementing ``__datafusion_table_provider__`` are also supported
and treated as :py:class:`~datafusion.TableProvider` instances.
"""
self.ctx.register_table_provider(name, provider)
warnings.warn(
"register_table_provider is deprecated; use register_table",
DeprecationWarning,
stacklevel=2,
)
self.register_table(name, provider)

def register_udtf(self, func: TableFunction) -> None:
"""Register a user defined table function."""
Expand Down
Loading
Loading