Skip to content

Conversation

kosiew
Copy link
Contributor

@kosiew kosiew commented Sep 21, 2025

Which issue does this PR close?

Rationale for this change

Users currently must explicitly register any in-memory Arrow/Pandas/Polars tables before running SQL. This makes quick exploratory workflows (where users create a DataFrame in Python and immediately query it with SQL) awkward because the user must call from_pandas / from_arrow or similar register helpers. The change implements an opt-in replacement-scan style lookup that inspects Python scopes to find variables whose names match missing table identifiers and automatically registers them if they expose Arrow-compatible data.

The behaviour is safe-by-default (disabled), but can be enabled either at session construction time or via SessionConfig. It improves ergonomics for REPL/notebook workflows while preserving existing semantics for applications that require explicit registration.

What changes are included in this PR?

Summary of functional changes:

  • Python API

    • Added SessionConfig.with_python_table_lookup(enabled: bool = True) to configure default behaviour.
    • SessionContext constructor accepts auto_register_python_objects: bool | None to opt into automatic lookup at construction time. If omitted, it uses the SessionConfig setting (default False).
    • Added SessionContext.set_python_table_lookup(enabled: bool = True) to toggle behaviour at runtime.
    • SessionContext.sql(...) will, when the feature is enabled, attempt to introspect missing table names from DataFusion errors, look up variables in the calling Python stack, and automatically register matching objects: DataFusion DataFrame views, Polars DataFrame, pandas DataFrame, and Arrow Table / RecordBatch / RecordBatchReader or objects exposing Arrow C data interfaces. Registration uses the existing from_pandas, from_arrow, from_polars, or register_view helpers.
    • Implemented weakref-based bindings cache (_python_table_bindings) to detect reassignment or garbage collection of Python objects and refresh/deregister session tables appropriately.
  • Error handling (Rust <-> Python bridge)

    • Enhanced the Rust wrapper so DataFusion errors that indicate missing tables have a missing_table_names attribute on the Python exception object when available. This enables robust detection of which table names caused planning failures.
    • Implemented a more robust parser (collect_missing_table_names) in Rust to extract table names from common message formats, including nested DataFusionError::Context / Diagnostic errors.
  • Documentation

    • Added documentation and examples for automatic variable registration to docs/source/user-guide/dataframe/index.rst and docs/source/user-guide/sql.rst demonstrating usage with pandas/pyarrow and how to enable the feature.
  • Tests

    • Added many unit tests in python/tests/test_context.py covering:

      • test_sql_missing_table_without_auto_register
      • test_sql_missing_table_exposes_missing_table_names
      • test_extract_missing_table_names_from_attribute
      • test_sql_auto_register_arrow_table
      • test_sql_auto_register_multiple_tables_single_query
      • test_sql_auto_register_arrow_outer_scope
      • test_sql_auto_register_skips_none_shadowing
      • test_sql_auto_register_case_insensitive_lookup
      • test_sql_auto_register_pandas_dataframe
      • test_sql_auto_register_refreshes_reassigned_dataframe
      • test_sql_auto_register_polars_dataframe
      • test_sql_from_local_arrow_table
      • test_sql_from_local_pandas_dataframe
      • test_sql_from_local_polars_dataframe
      • test_sql_from_local_unsupported_object
      • test_session_config_python_table_lookup_enables_auto_registration
      • test_sql_auto_register_arrow
      • test_sql_auto_register_disabled

Implementation notes / design decisions

  • Opt-in by default: The feature is off unless the user either passes auto_register_python_objects=True to SessionContext(...) or calls SessionConfig.with_python_table_lookup(True) when creating the session config.

  • Call-stack introspection: We walk Python frames (using inspect) to find variables that match missing table names. Lookup is case-insensitive and prefers exact name matches; it skips None shadowing to avoid registering unintentionally shadowed values.

  • Caching & refresh: A weakref reference to the registered Python object and its id() are stored so we can detect reassignment or object collection and refresh session bindings when needed.

  • Robust missing-table extraction: Because DataFusion error messages vary and the Python bindings may receive nested errors, we attempt to extract missing table names from a missing_table_names attribute (added by Rust) and fall back to regex-based extraction from the error message.

Are these changes tested?

  • Yes — multiple unit tests were added to python/tests/test_context.py to exercise both the registration flow and the failure modes. The Rust side changes are exercised indirectly via the Python tests which assert the presence of missing_table_names in raised exceptions and the successful registration behaviour.

If additional Rust unit tests are desired for the collect_missing_table_names parsing helper, they can be added (not included in this PR).

Are there any user-facing changes?

  • Yes. New optional behaviour that automatically registers Python objects referenced in SQL when enabled. This is an opt-in feature and is disabled by default.

  • New configuration options & methods:

    • SessionConfig.with_python_table_lookup(enabled: bool)
    • SessionContext(auto_register_python_objects=...)
    • SessionContext.set_python_table_lookup(enabled: bool)
  • Documentation updated with examples demonstrating the feature.

Backwards compatibility

No breaking API changes to existing functions. Default behaviour is unchanged (feature disabled) so existing applications that rely on explicit registration will not be affected.

Example usage

from datafusion import SessionContext, SessionConfig
import pandas as pd

# construct with session-level default enabled
ctx = SessionContext(config=SessionConfig().with_python_table_lookup(True))
pdf = pd.DataFrame({"value": [1,2,3]})
res = ctx.sql("SELECT SUM(value) AS total FROM pdf").to_pandas()

# or enable per-session
ctx2 = SessionContext(auto_register_python_objects=True)

…to_register_python_objects in SessionContext
@kosiew kosiew marked this pull request as ready for review September 21, 2025 13:04
Comment on lines 237 to 238
the corresponding library (``pandas`` for pandas objects, ``pyarrow`` for
Arrow objects) to be installed.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For both, this should use the Arrow PyCapsule Interface, no? Then you don't need any specific dependency because Pandas objects will by definition already have Pandas installed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This section of index.rst was trying to illustrate this:

   pdf = ... # pandas dataframe <-- this requires pandas to be installed
   # or
   pdf = .... # pyarrow object. <-- this requires pyarrow to be installed
   
   # if automatic registration  is enabled, then we can query pdf like this
   ctx.sql("select from pdf")
   # without calling ctx.from_pandas, ctx.from_arrow

I will amend the section to convey this better.


``SessionContext`` can automatically resolve SQL table names that match
in-scope Python data objects. When automatic lookup is enabled, a query
such as ``ctx.sql("SELECT * FROM pdf")`` will register a pandas or
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the registration temporary? Or after the query ends is pdf now still bound to the specific object?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registrations persist: once a variable is bound we cache a weak reference plus its id in _python_table_bindings. On every subsequent SQL call we refresh that cache—dropping the registration if the object has been garbage collected, reassigned, or otherwise moved—but as long as the original object is still alive the table name remains usable across queries.

import pandas as pd
from datafusion import SessionContext
ctx = SessionContext(auto_register_python_objects=True)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a long parameter; what do we think about turning it on by default and/or choosing a shorter name?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flipping this on by default would change long-standing failure modes—queries that currently raise “table not found” would start consulting the caller’s scope, which could mask mistakes or introduce non-deterministic behavior when multiple similarly named objects exist.
Because the feature walks Python stack frames, leaving it opt-in keeps that overhead and surprise factor away from existing users. I’m open to amending the flag name later; I chose auto_register_python_objects to make the opt-in explicit, but we can follow up if we find a cleaner alias that still differentiates it from the existing python_table_lookup config switch.

Comment on lines 797 to 816
if isinstance(obj, DataFrame):
self.register_view(name, obj)
registered = True
elif (
obj.__class__.__module__.startswith("polars.")
and obj.__class__.__name__ == "DataFrame"
):
self.from_polars(obj, name=name)
registered = True
elif (
obj.__class__.__module__.startswith("pandas.")
and obj.__class__.__name__ == "DataFrame"
):
self.from_pandas(obj, name=name)
registered = True
elif isinstance(obj, (pa.Table, pa.RecordBatch, pa.RecordBatchReader)) or (
hasattr(obj, "__arrow_c_stream__") or hasattr(obj, "__arrow_c_array__")
):
self.from_arrow(obj, name=name)
registered = True
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO all of this should (or at least could) be replaced with

hasattr(obj, "__arrow_c_stream__")

to use the PyCapsule Interface. Unless we want to support old versions of Pandas and Polars?

Copy link
Contributor Author

@kosiew kosiew Sep 23, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point.

I will invert the if comparison to check for

hasattr(obj, "__arrow_c_stream__")

before falling back to checking for modules as there are older versions of Pandas (and maybe Polars) that don't support arrow_c_stream

@timsaucer
Copy link
Member

This feels like a lot of work that is brittle to changes upstream to try to interrupt the error and provide an alternate.

It feels like it would be simpler to provide an alternate SchemaProvider which would perform like the default in memory schema provider and if the table requested doesn't exist then to do the checks you have for variables in memory that could fit. To me that would be both (1) in line with what I think the real goal here is about searching for tables and (2) far less brittle to changes in the error handling that we're intercepting.

What do you think?

@timsaucer
Copy link
Member

If we do switch to my suggestion, then we do need to make sure that users know that if they provide their own catalog provider then they will not get this behavior.

Also, I think some custom catalog providers might break this current approach if their error handling doesn't match what you're expecting here.

@kosiew
Copy link
Contributor Author

kosiew commented Oct 1, 2025

@timsaucer

simpler to provide an alternate SchemaProvider which would perform like the default in memory schema provider

Thanks for the suggestion.

I prototyped a SchemaProvider hook, but it ran into a hard limitation: the auto-registration logic needs to inspect the caller’s Python frames to discover in-scope variables, which we do today in _lookup_python_object by walking the active stack.
During planning we hand control to wait_for_future, which releases the GIL and executes the resolver on Tokio worker threads; by the time DataFusion asks a SchemaProvider for a table, we’re no longer running on the original Python stack, so there’s nothing to inspect from inside the provider.
Catching the missing-table error inside SessionContext.sql keeps us on the initiating thread, which is the only place we can reliably enumerate the user’s variables.

@kosiew
Copy link
Contributor Author

kosiew commented Oct 1, 2025

brittle to changes upstream to try to interrupt the error and provide an alternate.

We mitigated the brittleness by moving the fragile string matching into a single helper that first looks for a missing_table_names attribute the Rust binding now attaches whenever DataFusion raises a missing-table error, so the retry loop only triggers when the engine itself told us which names were unresolved.
The regex fallback is kept solely for backward-compatibility with older providers that don’t populate that attribute, and the test suite locks in both code paths so we notice if either signal ever changes upstream.

@kosiew kosiew requested a review from kylebarron October 1, 2025 14:46
@timsaucer
Copy link
Member

I recommend we put this PR on hold until we get resolution in issue #513 about which way we want to go - parameterized queries or auto injection.

@timsaucer
Copy link
Member

Based on the conversation in #513 I suggest we close this PR in favor of using a method where the user specifies as placeholders. I have a PR nearly ready for that.

@kosiew
Copy link
Contributor Author

kosiew commented Oct 12, 2025

Agreed.

@kosiew kosiew closed this Oct 12, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support for automatic replacement scans

3 participants