-
Notifications
You must be signed in to change notification settings - Fork 128
Introduce Table wrapper, unify table registration via register_table; deprecate legacy APIs #1243
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from 10 commits
9b4f144
3da3f93
512442b
a8275dc
00bd445
6869919
6e46d43
38af2b5
1872a7f
5948fb4
b9851d8
586c2cf
d4ff136
29203c6
ae8c1dd
0c5eb17
f9a3a22
f930181
afc9b4e
918b1ce
93f0a31
7bc303d
4429614
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -83,6 +83,9 @@ def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa | |
# Duplicate modules (skip module-level docs to avoid duplication) | ||
("module", "datafusion.col"), | ||
("module", "datafusion.udf"), | ||
# Private variables causing duplicate documentation | ||
("data", "datafusion.utils._PYARROW_DATASET_TYPES"), | ||
("variable", "datafusion.utils._PYARROW_DATASET_TYPES"), | ||
# Deprecated | ||
("class", "datafusion.substrait.serde"), | ||
("class", "datafusion.substrait.plan"), | ||
|
@@ -94,6 +97,10 @@ def autoapi_skip_member_fn(app, what, name, obj, skip, options) -> bool: # noqa | |
if (what, name) in skip_contents: | ||
skip = True | ||
|
||
# Skip private members that start with underscore to avoid duplication | ||
if name.split(".")[-1].startswith("_") and what in ("data", "variable"): | ||
skip = True | ||
|
||
|
||
return skip | ||
|
||
|
||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -152,13 +152,22 @@ as Delta Lake. This will require a recent version of | |
.. code-block:: python | ||
|
||
from deltalake import DeltaTable | ||
from datafusion import TableProvider | ||
|
||
delta_table = DeltaTable("path_to_table") | ||
ctx.register_table_provider("my_delta_table", delta_table) | ||
provider = TableProvider.from_capsule(delta_table.__datafusion_table_provider__()) | ||
ctx.register_table("my_delta_table", provider) | ||
|
||
df = ctx.table("my_delta_table") | ||
df.show() | ||
|
||
On older versions of ``deltalake`` (prior to 0.22) you can use the | ||
.. note:: | ||
|
||
:py:meth:`~datafusion.context.SessionContext.register_table_provider` is | ||
deprecated. Use | ||
:py:meth:`~datafusion.context.SessionContext.register_table` with a | ||
:py:class:`~datafusion.TableProvider` instead. | ||
|
||
On older versions of ``deltalake`` (prior to 0.22) you can use the | ||
`Arrow DataSet <https://arrow.apache.org/docs/python/generated/pyarrow.dataset.Dataset.html>`_ | ||
interface to import to DataFusion, but this does not support features such as filter push down | ||
which can lead to a significant performance difference. | ||
|
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -39,20 +39,47 @@ A complete example can be found in the `examples folder <https://github.com/apac | |
) -> PyResult<Bound<'py, PyCapsule>> { | ||
let name = CString::new("datafusion_table_provider").unwrap(); | ||
|
||
let provider = Arc::new(self.clone()) | ||
.map_err(|e| PyRuntimeError::new_err(e.to_string()))?; | ||
let provider = FFI_TableProvider::new(Arc::new(provider), false); | ||
let provider = Arc::new(self.clone()); | ||
let provider = FFI_TableProvider::new(provider, false, None); | ||
|
||
PyCapsule::new_bound(py, provider, Some(name.clone())) | ||
} | ||
} | ||
|
||
Once you have this library available, in python you can register your table provider | ||
to the ``SessionContext``. | ||
Once you have this library available, you can construct a | ||
:py:class:`~datafusion.TableProvider` in Python and register it with the | ||
``SessionContext``. Table providers can be created either from the PyCapsule exposed by | ||
your Rust provider or from an existing :py:class:`~datafusion.dataframe.DataFrame`. | ||
Call the provider's ``__datafusion_table_provider__()`` method to obtain the capsule | ||
before constructing a ``TableProvider``. The ``TableProvider.from_view()`` helper is | ||
deprecated; instead use ``TableProvider.from_dataframe()`` or ``DataFrame.into_view()``. | ||
|
||
.. note:: | ||
|
||
:py:meth:`~datafusion.context.SessionContext.register_table_provider` is | ||
deprecated. Use | ||
:py:meth:`~datafusion.context.SessionContext.register_table` with the | ||
resulting :py:class:`~datafusion.TableProvider` instead. | ||
|
||
.. code-block:: python | ||
|
||
from datafusion import SessionContext, TableProvider | ||
|
||
ctx = SessionContext() | ||
provider = MyTableProvider() | ||
ctx.register_table_provider("my_table", provider) | ||
|
||
ctx.table("my_table").show() | ||
capsule = provider.__datafusion_table_provider__() | ||
capsule_provider = TableProvider.from_capsule(capsule) | ||
|
||
df = ctx.from_pydict({"a": [1]}) | ||
view_provider = TableProvider.from_dataframe(df) | ||
# or: view_provider = df.into_view() | ||
|
||
ctx.register_table("capsule_table", capsule_provider) | ||
ctx.register_table("view_table", view_provider) | ||
|
||
ctx.table("capsule_table").show() | ||
ctx.table("view_table").show() | ||
Comment on lines
71
to
82
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This example takes a bit of cognitive load to understand what we're demonstrating. First off, similar to my comments above I don't think we want our users to have to think about if they're using something that comes from a PyCapsule interface or not. Suppose I am a library user and I get a delta table object that implements PyCapsule. As a user of that library, I shouldn't have to understand how the interfacing works. I should just be able to use it directly. So I want to be able to just pass those objects directly to |
||
|
||
Both ``TableProvider.from_capsule()`` and ``TableProvider.from_dataframe()`` create | ||
table providers that can be registered with the SessionContext using ``register_table()``. |
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -21,24 +21,26 @@ | |
See https://datafusion.apache.org/python for more information. | ||
""" | ||
|
||
# isort: skip_file # Prevent import-sorting linter errors (I001) | ||
# ruff: noqa: I001 | ||
|
||
|
||
from __future__ import annotations | ||
|
||
from typing import Any | ||
|
||
try: | ||
import importlib.metadata as importlib_metadata | ||
except ImportError: | ||
import importlib_metadata | ||
import importlib_metadata # type: ignore[import] | ||
|
||
# Public submodules | ||
from . import functions, object_store, substrait, unparser | ||
|
||
# The following imports are okay to remain as opaque to the user. | ||
from ._internal import Config | ||
from ._internal import Config, EXPECTED_PROVIDER_MSG | ||
from .catalog import Catalog, Database, Table | ||
from .col import col, column | ||
from .common import ( | ||
DFSchema, | ||
) | ||
from .common import DFSchema | ||
from .context import ( | ||
RuntimeEnvBuilder, | ||
SessionConfig, | ||
|
@@ -47,13 +49,11 @@ | |
) | ||
from .dataframe import DataFrame, ParquetColumnOptions, ParquetWriterOptions | ||
from .dataframe_formatter import configure_formatter | ||
from .expr import ( | ||
Expr, | ||
WindowFrame, | ||
) | ||
from .expr import Expr, WindowFrame | ||
from .io import read_avro, read_csv, read_json, read_parquet | ||
from .plan import ExecutionPlan, LogicalPlan | ||
from .record_batch import RecordBatch, RecordBatchStream | ||
from .table_provider import TableProvider | ||
from .user_defined import ( | ||
Accumulator, | ||
AggregateUDF, | ||
|
@@ -69,6 +69,7 @@ | |
__version__ = importlib_metadata.version(__name__) | ||
|
||
__all__ = [ | ||
"EXPECTED_PROVIDER_MSG", | ||
"Accumulator", | ||
"AggregateUDF", | ||
"Catalog", | ||
|
@@ -90,6 +91,7 @@ | |
"SessionContext", | ||
"Table", | ||
"TableFunction", | ||
"TableProvider", | ||
"WindowFrame", | ||
"WindowUDF", | ||
"catalog", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These changelogs are automatically generated, so I don't think we want to make changes here. Regardless, these would go into the 51.0.0 release.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will revert this change.