SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247

kosiew · 2025-09-21T11:24:15Z

Which issue does this PR close?

Closes Support for automatic replacement scans #1244

Rationale for this change

Users currently must explicitly register any in-memory Arrow/Pandas/Polars tables before running SQL. This makes quick exploratory workflows (where users create a DataFrame in Python and immediately query it with SQL) awkward because the user must call from_pandas / from_arrow or similar register helpers. The change implements an opt-in replacement-scan style lookup that inspects Python scopes to find variables whose names match missing table identifiers and automatically registers them if they expose Arrow-compatible data.

The behaviour is safe-by-default (disabled), but can be enabled either at session construction time or via SessionConfig. It improves ergonomics for REPL/notebook workflows while preserving existing semantics for applications that require explicit registration.

What changes are included in this PR?

Summary of functional changes:

Python API
- Added SessionConfig.with_python_table_lookup(enabled: bool = True) to configure default behaviour.
- SessionContext constructor accepts auto_register_python_objects: bool | None to opt into automatic lookup at construction time. If omitted, it uses the SessionConfig setting (default False).
- Added SessionContext.set_python_table_lookup(enabled: bool = True) to toggle behaviour at runtime.
- SessionContext.sql(...) will, when the feature is enabled, attempt to introspect missing table names from DataFusion errors, look up variables in the calling Python stack, and automatically register matching objects: DataFusion DataFrame views, Polars DataFrame, pandas DataFrame, and Arrow Table / RecordBatch / RecordBatchReader or objects exposing Arrow C data interfaces. Registration uses the existing from_pandas, from_arrow, from_polars, or register_view helpers.
- Implemented weakref-based bindings cache (_python_table_bindings) to detect reassignment or garbage collection of Python objects and refresh/deregister session tables appropriately.
Error handling (Rust <-> Python bridge)
- Enhanced the Rust wrapper so DataFusion errors that indicate missing tables have a missing_table_names attribute on the Python exception object when available. This enables robust detection of which table names caused planning failures.
- Implemented a more robust parser (collect_missing_table_names) in Rust to extract table names from common message formats, including nested DataFusionError::Context / Diagnostic errors.
Documentation
- Added documentation and examples for automatic variable registration to docs/source/user-guide/dataframe/index.rst and docs/source/user-guide/sql.rst demonstrating usage with pandas/pyarrow and how to enable the feature.
Tests
- Added many unit tests in python/tests/test_context.py covering:
  - test_sql_missing_table_without_auto_register
  - test_sql_missing_table_exposes_missing_table_names
  - test_extract_missing_table_names_from_attribute
  - test_sql_auto_register_arrow_table
  - test_sql_auto_register_multiple_tables_single_query
  - test_sql_auto_register_arrow_outer_scope
  - test_sql_auto_register_skips_none_shadowing
  - test_sql_auto_register_case_insensitive_lookup
  - test_sql_auto_register_pandas_dataframe
  - test_sql_auto_register_refreshes_reassigned_dataframe
  - test_sql_auto_register_polars_dataframe
  - test_sql_from_local_arrow_table
  - test_sql_from_local_pandas_dataframe
  - test_sql_from_local_polars_dataframe
  - test_sql_from_local_unsupported_object
  - test_session_config_python_table_lookup_enables_auto_registration
  - test_sql_auto_register_arrow
  - test_sql_auto_register_disabled

Implementation notes / design decisions

Opt-in by default: The feature is off unless the user either passes auto_register_python_objects=True to SessionContext(...) or calls SessionConfig.with_python_table_lookup(True) when creating the session config.
Call-stack introspection: We walk Python frames (using inspect) to find variables that match missing table names. Lookup is case-insensitive and prefers exact name matches; it skips None shadowing to avoid registering unintentionally shadowed values.
Caching & refresh: A weakref reference to the registered Python object and its id() are stored so we can detect reassignment or object collection and refresh session bindings when needed.
Robust missing-table extraction: Because DataFusion error messages vary and the Python bindings may receive nested errors, we attempt to extract missing table names from a missing_table_names attribute (added by Rust) and fall back to regex-based extraction from the error message.

Are these changes tested?

Yes — multiple unit tests were added to python/tests/test_context.py to exercise both the registration flow and the failure modes. The Rust side changes are exercised indirectly via the Python tests which assert the presence of missing_table_names in raised exceptions and the successful registration behaviour.

If additional Rust unit tests are desired for the collect_missing_table_names parsing helper, they can be added (not included in this PR).

Are there any user-facing changes?

Yes. New optional behaviour that automatically registers Python objects referenced in SQL when enabled. This is an opt-in feature and is disabled by default.
New configuration options & methods:
- SessionConfig.with_python_table_lookup(enabled: bool)
- SessionContext(auto_register_python_objects=...)
- SessionContext.set_python_table_lookup(enabled: bool)
Documentation updated with examples demonstrating the feature.

Backwards compatibility

No breaking API changes to existing functions. Default behaviour is unchanged (feature disabled) so existing applications that rely on explicit registration will not be affected.

Example usage

from datafusion import SessionContext, SessionConfig
import pandas as pd

# construct with session-level default enabled
ctx = SessionContext(config=SessionConfig().with_python_table_lookup(True))
pdf = pd.DataFrame({"value": [1,2,3]})
res = ctx.sql("SELECT SUM(value) AS total FROM pdf").to_pandas()

# or enable per-session
ctx2 = SessionContext(auto_register_python_objects=True)

… objects in SQL queries

…ndas, and Polars dataframes

…and add corresponding tests

…ython_variables in SessionContext

…hon objects via session config

…to_register_python_objects in SessionContext

…Context

…les in SQL queries

…essionContext

…ng in SQL queries

… context

…ython_objects in SessionContext

kylebarron · 2025-09-23T01:04:45Z

docs/source/user-guide/dataframe/index.rst

+    the corresponding library (``pandas`` for pandas objects, ``pyarrow`` for
+    Arrow objects) to be installed.


For both, this should use the Arrow PyCapsule Interface, no? Then you don't need any specific dependency because Pandas objects will by definition already have Pandas installed.

This section of index.rst was trying to illustrate this:

pdf = ... # pandas dataframe <-- this requires pandas to be installed # or pdf = .... # pyarrow object. <-- this requires pyarrow to be installed # if automatic registration is enabled, then we can query pdf like this ctx.sql("select from pdf") # without calling ctx.from_pandas, ctx.from_arrow

I will amend the section to convey this better.

kylebarron · 2025-09-23T01:05:23Z

docs/source/user-guide/dataframe/index.rst


+    ``SessionContext`` can automatically resolve SQL table names that match
+    in-scope Python data objects. When automatic lookup is enabled, a query
+    such as ``ctx.sql("SELECT * FROM pdf")`` will register a pandas or


Is the registration temporary? Or after the query ends is pdf now still bound to the specific object?

Registrations persist: once a variable is bound we cache a weak reference plus its id in _python_table_bindings. On every subsequent SQL call we refresh that cache—dropping the registration if the object has been garbage collected, reassigned, or otherwise moved—but as long as the original object is still alive the table name remains usable across queries.

kylebarron · 2025-09-23T01:06:02Z

docs/source/user-guide/dataframe/index.rst

+        import pandas as pd
+        from datafusion import SessionContext
+
+        ctx = SessionContext(auto_register_python_objects=True)


This is a long parameter; what do we think about turning it on by default and/or choosing a shorter name?

Flipping this on by default would change long-standing failure modes—queries that currently raise “table not found” would start consulting the caller’s scope, which could mask mistakes or introduce non-deterministic behavior when multiple similarly named objects exist.
Because the feature walks Python stack frames, leaving it opt-in keeps that overhead and surprise factor away from existing users. I’m open to amending the flag name later; I chose auto_register_python_objects to make the opt-in explicit, but we can follow up if we find a cleaner alias that still differentiates it from the existing python_table_lookup config switch.

kylebarron · 2025-09-23T01:09:04Z

python/datafusion/context.py

+        if isinstance(obj, DataFrame):
+            self.register_view(name, obj)
+            registered = True
+        elif (
+            obj.__class__.__module__.startswith("polars.")
+            and obj.__class__.__name__ == "DataFrame"
+        ):
+            self.from_polars(obj, name=name)
+            registered = True
+        elif (
+            obj.__class__.__module__.startswith("pandas.")
+            and obj.__class__.__name__ == "DataFrame"
+        ):
+            self.from_pandas(obj, name=name)
+            registered = True
+        elif isinstance(obj, (pa.Table, pa.RecordBatch, pa.RecordBatchReader)) or (
+            hasattr(obj, "__arrow_c_stream__") or hasattr(obj, "__arrow_c_array__")
+        ):
+            self.from_arrow(obj, name=name)
+            registered = True


IMO all of this should (or at least could) be replaced with

hasattr(obj, "__arrow_c_stream__")

to use the PyCapsule Interface. Unless we want to support old versions of Pandas and Polars?

Good point.

I will invert the if comparison to check for

hasattr(obj, "__arrow_c_stream__")

before falling back to checking for modules as there are older versions of Pandas (and maybe Polars) that don't support arrow_c_stream

… SessionContext

…ects in SessionContext

…pandas_dataframe

timsaucer · 2025-09-30T22:35:30Z

This feels like a lot of work that is brittle to changes upstream to try to interrupt the error and provide an alternate.

It feels like it would be simpler to provide an alternate SchemaProvider which would perform like the default in memory schema provider and if the table requested doesn't exist then to do the checks you have for variables in memory that could fit. To me that would be both (1) in line with what I think the real goal here is about searching for tables and (2) far less brittle to changes in the error handling that we're intercepting.

What do you think?

timsaucer · 2025-09-30T22:36:53Z

If we do switch to my suggestion, then we do need to make sure that users know that if they provide their own catalog provider then they will not get this behavior.

Also, I think some custom catalog providers might break this current approach if their error handling doesn't match what you're expecting here.

kosiew · 2025-10-01T14:39:52Z

@timsaucer

simpler to provide an alternate SchemaProvider which would perform like the default in memory schema provider

Thanks for the suggestion.

I prototyped a SchemaProvider hook, but it ran into a hard limitation: the auto-registration logic needs to inspect the caller’s Python frames to discover in-scope variables, which we do today in _lookup_python_object by walking the active stack.
During planning we hand control to wait_for_future, which releases the GIL and executes the resolver on Tokio worker threads; by the time DataFusion asks a SchemaProvider for a table, we’re no longer running on the original Python stack, so there’s nothing to inspect from inside the provider.
Catching the missing-table error inside SessionContext.sql keeps us on the initiating thread, which is the only place we can reliably enumerate the user’s variables.

kosiew · 2025-10-01T14:46:01Z

brittle to changes upstream to try to interrupt the error and provide an alternate.

We mitigated the brittleness by moving the fragile string matching into a single helper that first looks for a missing_table_names attribute the Rust binding now attaches whenever DataFusion raises a missing-table error, so the retry loop only triggers when the engine itself told us which names were unresolved.
The regex fallback is kept solely for backward-compatibility with older providers that don’t populate that attribute, and the test suite locks in both code paths so we notice if either signal ever changes upstream.

timsaucer · 2025-10-02T12:41:35Z

I recommend we put this PR on hold until we get resolution in issue #513 about which way we want to go - parameterized queries or auto injection.

timsaucer · 2025-10-11T14:32:52Z

Based on the conversation in #513 I suggest we close this PR in favor of using a method where the user specifies as placeholders. I have a PR nearly ready for that.

kosiew · 2025-10-12T08:22:59Z

Agreed.

kosiew added 16 commits September 21, 2025 18:59

feat: add automatic variable registration for Arrow-compatible Python…

66d74a3

… objects in SQL queries

fix: remove noqa directive for uuid4 import in test_table_not_found

65e4492

feat: enable implicit table lookup for Python objects in SQL queries

53a62f7

feat: enhance table name extraction and add tests for local Arrow, Pa…

1f36102

…ndas, and Polars dataframes

feat: enable automatic registration of Python objects in SQL queries …

92dde5b

…and add corresponding tests

feat: add deprecation warnings and alias handling for auto_register_p…

db2d239

…ython_variables in SessionContext

feat: enhance SessionContext to support automatic registration of Pyt…

8fc3e1c

…hon objects via session config

fix: correct parameter name from auto_register_python_variables to au…

b733408

…to_register_python_objects in SessionContext

feat: add normalization for missing table names extraction in Session…

fb3dadb

…Context

test: add unit test for automatic registration of multiple Python tab…

6454b8c

…les in SQL queries

refactor: clean up unused imports and simplify conditional logic in S…

dc1b392

…essionContext

feat: add support for automatic registration of Polars DataFrame in S…

904c1ca

…essionContext

test: add tests for case-insensitive lookup and skipping None shadowi…

1764a57

…ng in SQL queries

test: add unit test for refreshing reassigned pandas DataFrame in SQL…

b9041ba

… context

feat: enhance error handling for missing tables in SQL queries

ac1d6e1

refactor: replace auto_register_python_variables with auto_register_p…

15b5cec

…ython_objects in SessionContext

kosiew marked this pull request as ready for review September 21, 2025 13:04

kylebarron reviewed Sep 23, 2025

View reviewed changes

kosiew added 4 commits September 23, 2025 11:02

docs: clarify automatic registration of pandas and pyarrow objects in…

dc06874

… SessionContext

refactor: improve auto-registration logic for Arrow and DataFrame obj…

78c26cc

…ects in SessionContext

fix(tests): remove unused variable warning in test_sql_auto_register_…

57d6380

…pandas_dataframe

Fix Ruff errors

1a1a5b4

kosiew requested a review from kylebarron October 1, 2025 14:46

timsaucer mentioned this pull request Oct 1, 2025

Is it possible to pass query parameters? (:param or ?) #513

Open

kosiew closed this Oct 12, 2025

		the corresponding library (``pandas`` for pandas objects, ``pyarrow`` for
		Arrow objects) to be installed.

SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247

SessionContext: automatically register Python (Arrow/Pandas/Polars) objects referenced in SQL #1247

Uh oh!

Conversation

kosiew commented Sep 21, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Implementation notes / design decisions

Are these changes tested?

Are there any user-facing changes?

Backwards compatibility

Example usage

Uh oh!

kylebarron Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kylebarron Sep 23, 2025

Choose a reason for hiding this comment

Uh oh!

kosiew Sep 23, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

timsaucer commented Sep 30, 2025

Uh oh!

timsaucer commented Sep 30, 2025

Uh oh!

kosiew commented Oct 1, 2025

Uh oh!

kosiew commented Oct 1, 2025

Uh oh!

timsaucer commented Oct 2, 2025

Uh oh!

timsaucer commented Oct 11, 2025

Uh oh!

kosiew commented Oct 12, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

kosiew Sep 23, 2025 •

edited

Loading