-
Notifications
You must be signed in to change notification settings - Fork 134
SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
… objects in SQL queries
…ndas, and Polars dataframes
…and add corresponding tests
…ython_variables in SessionContext
…hon objects via session config
…to_register_python_objects in SessionContext
…les in SQL queries
…ng in SQL queries
…ython_objects in SessionContext
| the corresponding library (``pandas`` for pandas objects, ``pyarrow`` for | ||
| Arrow objects) to be installed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For both, this should use the Arrow PyCapsule Interface, no? Then you don't need any specific dependency because Pandas objects will by definition already have Pandas installed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This section of index.rst was trying to illustrate this:
pdf = ... # pandas dataframe <-- this requires pandas to be installed
# or
pdf = .... # pyarrow object. <-- this requires pyarrow to be installed
# if automatic registration is enabled, then we can query pdf like this
ctx.sql("select from pdf")
# without calling ctx.from_pandas, ctx.from_arrowI will amend the section to convey this better.
|
|
||
| ``SessionContext`` can automatically resolve SQL table names that match | ||
| in-scope Python data objects. When automatic lookup is enabled, a query | ||
| such as ``ctx.sql("SELECT * FROM pdf")`` will register a pandas or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the registration temporary? Or after the query ends is pdf now still bound to the specific object?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registrations persist: once a variable is bound we cache a weak reference plus its id in _python_table_bindings. On every subsequent SQL call we refresh that cache—dropping the registration if the object has been garbage collected, reassigned, or otherwise moved—but as long as the original object is still alive the table name remains usable across queries.
| import pandas as pd | ||
| from datafusion import SessionContext | ||
| ctx = SessionContext(auto_register_python_objects=True) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is a long parameter; what do we think about turning it on by default and/or choosing a shorter name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flipping this on by default would change long-standing failure modes—queries that currently raise “table not found” would start consulting the caller’s scope, which could mask mistakes or introduce non-deterministic behavior when multiple similarly named objects exist.
Because the feature walks Python stack frames, leaving it opt-in keeps that overhead and surprise factor away from existing users. I’m open to amending the flag name later; I chose auto_register_python_objects to make the opt-in explicit, but we can follow up if we find a cleaner alias that still differentiates it from the existing python_table_lookup config switch.
| if isinstance(obj, DataFrame): | ||
| self.register_view(name, obj) | ||
| registered = True | ||
| elif ( | ||
| obj.__class__.__module__.startswith("polars.") | ||
| and obj.__class__.__name__ == "DataFrame" | ||
| ): | ||
| self.from_polars(obj, name=name) | ||
| registered = True | ||
| elif ( | ||
| obj.__class__.__module__.startswith("pandas.") | ||
| and obj.__class__.__name__ == "DataFrame" | ||
| ): | ||
| self.from_pandas(obj, name=name) | ||
| registered = True | ||
| elif isinstance(obj, (pa.Table, pa.RecordBatch, pa.RecordBatchReader)) or ( | ||
| hasattr(obj, "__arrow_c_stream__") or hasattr(obj, "__arrow_c_array__") | ||
| ): | ||
| self.from_arrow(obj, name=name) | ||
| registered = True |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO all of this should (or at least could) be replaced with
hasattr(obj, "__arrow_c_stream__")to use the PyCapsule Interface. Unless we want to support old versions of Pandas and Polars?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point.
I will invert the if comparison to check for
hasattr(obj, "__arrow_c_stream__")before falling back to checking for modules as there are older versions of Pandas (and maybe Polars) that don't support arrow_c_stream
…ects in SessionContext
|
This feels like a lot of work that is brittle to changes upstream to try to interrupt the error and provide an alternate. It feels like it would be simpler to provide an alternate What do you think? |
|
If we do switch to my suggestion, then we do need to make sure that users know that if they provide their own catalog provider then they will not get this behavior. Also, I think some custom catalog providers might break this current approach if their error handling doesn't match what you're expecting here. |
Thanks for the suggestion. I prototyped a SchemaProvider hook, but it ran into a hard limitation: the auto-registration logic needs to inspect the caller’s Python frames to discover in-scope variables, which we do today in _lookup_python_object by walking the active stack. |
We mitigated the brittleness by moving the fragile string matching into a single helper that first looks for a missing_table_names attribute the Rust binding now attaches whenever DataFusion raises a missing-table error, so the retry loop only triggers when the engine itself told us which names were unresolved. |
|
I recommend we put this PR on hold until we get resolution in issue #513 about which way we want to go - parameterized queries or auto injection. |
|
Based on the conversation in #513 I suggest we close this PR in favor of using a method where the user specifies as placeholders. I have a PR nearly ready for that. |
|
Agreed. |
Which issue does this PR close?
Rationale for this change
Users currently must explicitly register any in-memory Arrow/Pandas/Polars tables before running SQL. This makes quick exploratory workflows (where users create a DataFrame in Python and immediately query it with SQL) awkward because the user must call
from_pandas/from_arrowor similar register helpers. The change implements an opt-in replacement-scan style lookup that inspects Python scopes to find variables whose names match missing table identifiers and automatically registers them if they expose Arrow-compatible data.The behaviour is safe-by-default (disabled), but can be enabled either at session construction time or via
SessionConfig. It improves ergonomics for REPL/notebook workflows while preserving existing semantics for applications that require explicit registration.What changes are included in this PR?
Summary of functional changes:
Python API
SessionConfig.with_python_table_lookup(enabled: bool = True)to configure default behaviour.SessionContextconstructor acceptsauto_register_python_objects: bool | Noneto opt into automatic lookup at construction time. If omitted, it uses theSessionConfigsetting (defaultFalse).SessionContext.set_python_table_lookup(enabled: bool = True)to toggle behaviour at runtime.SessionContext.sql(...)will, when the feature is enabled, attempt to introspect missing table names from DataFusion errors, look up variables in the calling Python stack, and automatically register matching objects: DataFusionDataFrameviews, Polars DataFrame, pandas DataFrame, and ArrowTable/RecordBatch/RecordBatchReaderor objects exposing Arrow C data interfaces. Registration uses the existingfrom_pandas,from_arrow,from_polars, orregister_viewhelpers._python_table_bindings) to detect reassignment or garbage collection of Python objects and refresh/deregister session tables appropriately.Error handling (Rust <-> Python bridge)
missing_table_namesattribute on the Python exception object when available. This enables robust detection of which table names caused planning failures.collect_missing_table_names) in Rust to extract table names from common message formats, including nestedDataFusionError::Context/Diagnosticerrors.Documentation
docs/source/user-guide/dataframe/index.rstanddocs/source/user-guide/sql.rstdemonstrating usage with pandas/pyarrow and how to enable the feature.Tests
Added many unit tests in
python/tests/test_context.pycovering:test_sql_missing_table_without_auto_registertest_sql_missing_table_exposes_missing_table_namestest_extract_missing_table_names_from_attributetest_sql_auto_register_arrow_tabletest_sql_auto_register_multiple_tables_single_querytest_sql_auto_register_arrow_outer_scopetest_sql_auto_register_skips_none_shadowingtest_sql_auto_register_case_insensitive_lookuptest_sql_auto_register_pandas_dataframetest_sql_auto_register_refreshes_reassigned_dataframetest_sql_auto_register_polars_dataframetest_sql_from_local_arrow_tabletest_sql_from_local_pandas_dataframetest_sql_from_local_polars_dataframetest_sql_from_local_unsupported_objecttest_session_config_python_table_lookup_enables_auto_registrationtest_sql_auto_register_arrowtest_sql_auto_register_disabledImplementation notes / design decisions
Opt-in by default: The feature is off unless the user either passes
auto_register_python_objects=TruetoSessionContext(...)or callsSessionConfig.with_python_table_lookup(True)when creating the session config.Call-stack introspection: We walk Python frames (using
inspect) to find variables that match missing table names. Lookup is case-insensitive and prefers exact name matches; it skipsNoneshadowing to avoid registering unintentionally shadowed values.Caching & refresh: A
weakrefreference to the registered Python object and itsid()are stored so we can detect reassignment or object collection and refresh session bindings when needed.Robust missing-table extraction: Because DataFusion error messages vary and the Python bindings may receive nested errors, we attempt to extract missing table names from a
missing_table_namesattribute (added by Rust) and fall back to regex-based extraction from the error message.Are these changes tested?
python/tests/test_context.pyto exercise both the registration flow and the failure modes. The Rust side changes are exercised indirectly via the Python tests which assert the presence ofmissing_table_namesin raised exceptions and the successful registration behaviour.If additional Rust unit tests are desired for the
collect_missing_table_namesparsing helper, they can be added (not included in this PR).Are there any user-facing changes?
Yes. New optional behaviour that automatically registers Python objects referenced in SQL when enabled. This is an opt-in feature and is disabled by default.
New configuration options & methods:
SessionConfig.with_python_table_lookup(enabled: bool)SessionContext(auto_register_python_objects=...)SessionContext.set_python_table_lookup(enabled: bool)Documentation updated with examples demonstrating the feature.
Backwards compatibility
No breaking API changes to existing functions. Default behaviour is unchanged (feature disabled) so existing applications that rely on explicit registration will not be affected.
Example usage