-
Notifications
You must be signed in to change notification settings - Fork 128
SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
66d74a3
65e4492
53a62f7
1f36102
92dde5b
db2d239
8fc3e1c
b733408
fb3dadb
6454b8c
dc1b392
904c1ca
1764a57
b9041ba
ac1d6e1
15b5cec
dc06874
78c26cc
57d6380
1a1a5b4
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -228,6 +228,43 @@ Core Classes | |
* :py:meth:`~datafusion.SessionContext.from_pandas` - Create from Pandas DataFrame | ||
* :py:meth:`~datafusion.SessionContext.from_arrow` - Create from Arrow data | ||
|
||
``SessionContext`` can automatically resolve SQL table names that match | ||
in-scope Python data objects. When automatic lookup is enabled, a query | ||
such as ``ctx.sql("SELECT * FROM pdf")`` will register a pandas or | ||
PyArrow object named ``pdf`` without calling | ||
:py:meth:`~datafusion.SessionContext.from_pandas` or | ||
:py:meth:`~datafusion.SessionContext.from_arrow` explicitly. This uses | ||
the Arrow PyCapsule Interface, so the corresponding library (``pandas`` | ||
for pandas objects, ``pyarrow`` for Arrow objects) must be installed. | ||
|
||
.. code-block:: python | ||
|
||
import pandas as pd | ||
import pyarrow as pa | ||
from datafusion import SessionContext | ||
|
||
ctx = SessionContext(auto_register_python_objects=True) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a long parameter; what do we think about turning it on by default and/or choosing a shorter name? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Flipping this on by default would change long-standing failure modes—queries that currently raise “table not found” would start consulting the caller’s scope, which could mask mistakes or introduce non-deterministic behavior when multiple similarly named objects exist. |
||
|
||
# pandas dataframe - requires pandas to be installed | ||
pdf = pd.DataFrame({"value": [1, 2, 3]}) | ||
|
||
# or pyarrow object - requires pyarrow to be installed | ||
arrow_table = pa.table({"value": [1, 2, 3]}) | ||
|
||
# If automatic registration is enabled, then we can query these objects directly | ||
df = ctx.sql("SELECT SUM(value) AS total FROM pdf") | ||
# or | ||
df = ctx.sql("SELECT SUM(value) AS total FROM arrow_table") | ||
|
||
# without calling ctx.from_pandas() or ctx.from_arrow() explicitly | ||
|
||
Automatic lookup is disabled by default. Enable it by passing | ||
``auto_register_python_objects=True`` when constructing the session or by | ||
configuring :py:class:`~datafusion.SessionConfig` with | ||
:py:meth:`~datafusion.SessionConfig.with_python_table_lookup`. Use | ||
:py:meth:`~datafusion.SessionContext.set_python_table_lookup` to toggle the | ||
behaviour at runtime. | ||
|
||
See: :py:class:`datafusion.SessionContext` | ||
|
||
Expression Classes | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the registration temporary? Or after the query ends is
pdf
now still bound to the specific object?There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registrations persist: once a variable is bound we cache a weak reference plus its id in _python_table_bindings. On every subsequent SQL call we refresh that cache—dropping the registration if the object has been garbage collected, reassigned, or otherwise moved—but as long as the original object is still alive the table name remains usable across queries.