-
Notifications
You must be signed in to change notification settings - Fork 128
SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
… objects in SQL queries
…ndas, and Polars dataframes
…and add corresponding tests
…ython_variables in SessionContext
…hon objects via session config
…to_register_python_objects in SessionContext
…les in SQL queries
…ng in SQL queries
…ython_objects in SessionContext
the corresponding library (``pandas`` for pandas objects, ``pyarrow`` for | ||
Arrow objects) to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both, this should use the Arrow PyCapsule Interface, no? Then you don't need any specific dependency because Pandas objects will by definition already have Pandas installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of index.rst was trying to illustrate this:
pdf = ... # pandas dataframe <-- this requires pandas to be installed
# or
pdf = .... # pyarrow object. <-- this requires pyarrow to be installed
# if automatic registration is enabled, then we can query pdf like this
ctx.sql("select from pdf")
# without calling ctx.from_pandas, ctx.from_arrow
I will amend the section to convey this better.
|
||
``SessionContext`` can automatically resolve SQL table names that match | ||
in-scope Python data objects. When automatic lookup is enabled, a query | ||
such as ``ctx.sql("SELECT * FROM pdf")`` will register a pandas or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the registration temporary? Or after the query ends is pdf
now still bound to the specific object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registrations persist: once a variable is bound we cache a weak reference plus its id in _python_table_bindings. On every subsequent SQL call we refresh that cache—dropping the registration if the object has been garbage collected, reassigned, or otherwise moved—but as long as the original object is still alive the table name remains usable across queries.
import pandas as pd | ||
from datafusion import SessionContext | ||
ctx = SessionContext(auto_register_python_objects=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long parameter; what do we think about turning it on by default and/or choosing a shorter name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flipping this on by default would change long-standing failure modes—queries that currently raise “table not found” would start consulting the caller’s scope, which could mask mistakes or introduce non-deterministic behavior when multiple similarly named objects exist.
Because the feature walks Python stack frames, leaving it opt-in keeps that overhead and surprise factor away from existing users. I’m open to amending the flag name later; I chose auto_register_python_objects to make the opt-in explicit, but we can follow up if we find a cleaner alias that still differentiates it from the existing python_table_lookup config switch.
if isinstance(obj, DataFrame): | ||
self.register_view(name, obj) | ||
registered = True | ||
elif ( | ||
obj.__class__.__module__.startswith("polars.") | ||
and obj.__class__.__name__ == "DataFrame" | ||
): | ||
self.from_polars(obj, name=name) | ||
registered = True | ||
elif ( | ||
obj.__class__.__module__.startswith("pandas.") | ||
and obj.__class__.__name__ == "DataFrame" | ||
): | ||
self.from_pandas(obj, name=name) | ||
registered = True | ||
elif isinstance(obj, (pa.Table, pa.RecordBatch, pa.RecordBatchReader)) or ( | ||
hasattr(obj, "__arrow_c_stream__") or hasattr(obj, "__arrow_c_array__") | ||
): | ||
self.from_arrow(obj, name=name) | ||
registered = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO all of this should (or at least could) be replaced with
hasattr(obj, "__arrow_c_stream__")
to use the PyCapsule Interface. Unless we want to support old versions of Pandas and Polars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I will invert the if comparison to check for
hasattr(obj, "__arrow_c_stream__")
before falling back to checking for modules as there are older versions of Pandas (and maybe Polars) that don't support arrow_c_stream
…ects in SessionContext
This feels like a lot of work that is brittle to changes upstream to try to interrupt the error and provide an alternate. It feels like it would be simpler to provide an alternate What do you think? |
If we do switch to my suggestion, then we do need to make sure that users know that if they provide their own catalog provider then they will not get this behavior. Also, I think some custom catalog providers might break this current approach if their error handling doesn't match what you're expecting here. |
Thanks for the suggestion. I prototyped a SchemaProvider hook, but it ran into a hard limitation: the auto-registration logic needs to inspect the caller’s Python frames to discover in-scope variables, which we do today in _lookup_python_object by walking the active stack. |
We mitigated the brittleness by moving the fragile string matching into a single helper that first looks for a missing_table_names attribute the Rust binding now attaches whenever DataFusion raises a missing-table error, so the retry loop only triggers when the engine itself told us which names were unresolved. |
I recommend we put this PR on hold until we get resolution in issue #513 about which way we want to go - parameterized queries or auto injection. |
Based on the conversation in #513 I suggest we close this PR in favor of using a method where the user specifies as placeholders. I have a PR nearly ready for that. |
Which issue does this PR close?
Rationale for this change
Users currently must explicitly register any in-memory Arrow/Pandas/Polars tables before running SQL. This makes quick exploratory workflows (where users create a DataFrame in Python and immediately query it with SQL) awkward because the user must call
from_pandas
/from_arrow
or similar register helpers. The change implements an opt-in replacement-scan style lookup that inspects Python scopes to find variables whose names match missing table identifiers and automatically registers them if they expose Arrow-compatible data.The behaviour is safe-by-default (disabled), but can be enabled either at session construction time or via
SessionConfig
. It improves ergonomics for REPL/notebook workflows while preserving existing semantics for applications that require explicit registration.What changes are included in this PR?
Summary of functional changes:
Python API
SessionConfig.with_python_table_lookup(enabled: bool = True)
to configure default behaviour.SessionContext
constructor acceptsauto_register_python_objects: bool | None
to opt into automatic lookup at construction time. If omitted, it uses theSessionConfig
setting (defaultFalse
).SessionContext.set_python_table_lookup(enabled: bool = True)
to toggle behaviour at runtime.SessionContext.sql(...)
will, when the feature is enabled, attempt to introspect missing table names from DataFusion errors, look up variables in the calling Python stack, and automatically register matching objects: DataFusionDataFrame
views, Polars DataFrame, pandas DataFrame, and ArrowTable
/RecordBatch
/RecordBatchReader
or objects exposing Arrow C data interfaces. Registration uses the existingfrom_pandas
,from_arrow
,from_polars
, orregister_view
helpers._python_table_bindings
) to detect reassignment or garbage collection of Python objects and refresh/deregister session tables appropriately.Error handling (Rust <-> Python bridge)
missing_table_names
attribute on the Python exception object when available. This enables robust detection of which table names caused planning failures.collect_missing_table_names
) in Rust to extract table names from common message formats, including nestedDataFusionError::Context
/Diagnostic
errors.Documentation
docs/source/user-guide/dataframe/index.rst
anddocs/source/user-guide/sql.rst
demonstrating usage with pandas/pyarrow and how to enable the feature.Tests
Added many unit tests in
python/tests/test_context.py
covering:test_sql_missing_table_without_auto_register
test_sql_missing_table_exposes_missing_table_names
test_extract_missing_table_names_from_attribute
test_sql_auto_register_arrow_table
test_sql_auto_register_multiple_tables_single_query
test_sql_auto_register_arrow_outer_scope
test_sql_auto_register_skips_none_shadowing
test_sql_auto_register_case_insensitive_lookup
test_sql_auto_register_pandas_dataframe
test_sql_auto_register_refreshes_reassigned_dataframe
test_sql_auto_register_polars_dataframe
test_sql_from_local_arrow_table
test_sql_from_local_pandas_dataframe
test_sql_from_local_polars_dataframe
test_sql_from_local_unsupported_object
test_session_config_python_table_lookup_enables_auto_registration
test_sql_auto_register_arrow
test_sql_auto_register_disabled
Implementation notes / design decisions
Opt-in by default: The feature is off unless the user either passes
auto_register_python_objects=True
toSessionContext(...)
or callsSessionConfig.with_python_table_lookup(True)
when creating the session config.Call-stack introspection: We walk Python frames (using
inspect
) to find variables that match missing table names. Lookup is case-insensitive and prefers exact name matches; it skipsNone
shadowing to avoid registering unintentionally shadowed values.Caching & refresh: A
weakref
reference to the registered Python object and itsid()
are stored so we can detect reassignment or object collection and refresh session bindings when needed.Robust missing-table extraction: Because DataFusion error messages vary and the Python bindings may receive nested errors, we attempt to extract missing table names from a
missing_table_names
attribute (added by Rust) and fall back to regex-based extraction from the error message.Are these changes tested?
python/tests/test_context.py
to exercise both the registration flow and the failure modes. The Rust side changes are exercised indirectly via the Python tests which assert the presence ofmissing_table_names
in raised exceptions and the successful registration behaviour.If additional Rust unit tests are desired for the
collect_missing_table_names
parsing helper, they can be added (not included in this PR).Are there any user-facing changes?
Yes. New optional behaviour that automatically registers Python objects referenced in SQL when enabled. This is an opt-in feature and is disabled by default.
New configuration options & methods:
SessionConfig.with_python_table_lookup(enabled: bool)
SessionContext(auto_register_python_objects=...)
SessionContext.set_python_table_lookup(enabled: bool)
Documentation updated with examples demonstrating the feature.
Backwards compatibility
No breaking API changes to existing functions. Default behaviour is unchanged (feature disabled) so existing applications that rely on explicit registration will not be affected.
Example usage