Skip to content

Conversation

@cpsievert
Copy link
Contributor

@cpsievert cpsievert commented Jan 14, 2026

Summary

  • Add IbisSource DataSource that accepts Ibis Tables and keeps queries lazy
  • Returns Ibis Tables from execute_query() allowing users to chain additional operations
  • Works with any Ibis backend (DuckDB, PostgreSQL, BigQuery, Snowflake, etc.)

Usage

import ibis
from querychat import QueryChat

# Connect to any Ibis-supported backend
conn = ibis.duckdb.connect("my_database.duckdb")
table = conn.table("my_table")

# Pass Ibis Table directly to QueryChat
qc = QueryChat(table, "my_table")
app = qc.app()

When building a custom app, the df() method returns a lazy Ibis Table that can be chained with additional operations:

# Get lazy result from LLM-generated query
result = qc.df()

# Chain additional Ibis operations
filtered = result.filter(result.status == "active")

# Materialize when ready
df = filtered.to_pandas()

Changes

  • Dependencies: Added ibis optional dependency with pandas for render boundary
  • IbisSource class: New DataSource with lazy execution semantics
  • Integration: Updated type signatures, render boundary, detection in normalize_data_source()
  • Tests: 15 unit tests + 1 integration test (all passing)
  • Documentation: Added Ibis Tables section to data-sources.qmd

Test Plan

  • Unit tests for IbisSource (15 tests)
  • Integration test with QueryChat
  • All tests pass: uv run pytest pkg-py/tests/test_ibis_source.py pkg-py/tests/test_querychat.py -v

Note: This PR depends on #191 (PolarsLazySource) and should be merged after it.

🤖 Generated with Claude Code

cpsievert and others added 6 commits January 14, 2026 10:51
Add a new DataSource implementation that keeps Polars LazyFrames lazy
until the render boundary. Key changes:

- Add `AnyFrame` type alias (`Union[nw.DataFrame, nw.LazyFrame]`)
- Widen DataSource ABC return types to support lazy frames
- Implement `PolarsLazySource` using Polars SQLContext for lazy SQL
- Update `normalize_data_source()` to detect and route LazyFrames
- Collect LazyFrames at render boundary in `app()` method
- Update type hints throughout

Usage:
```python
import polars as pl
from querychat import QueryChat

lf = pl.scan_parquet("large_data.parquet")
qc = QueryChat(data_source=lf, table_name="data")
```

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PolarsLazySource._polars_dtype_to_sql was mapping pl.Time to "TIMESTAMP"
but it should map to "TIME". Time-only values are not timestamps.

Also added noqa comment for PLR0911 (too many return statements) since
the function now has 7 return statements after the fix.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, test_query only validated schema structure via collect_schema()
without executing the query. This meant runtime errors (e.g., invalid casts)
wouldn't surface until actual collection.

Now test_query collects one row to catch runtime errors, matching the behavior
of DataFrameSource.test_query. The return type changes from LazyFrame to
DataFrame since we've already done the work.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The noqa: A005 comment was accidentally removed from types/__init__.py.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert marked this pull request as draft January 14, 2026 19:03
cpsievert and others added 8 commits January 14, 2026 13:06
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add IbisSource class to _datasource.py that wraps Ibis Tables for use
with QueryChat. Key features:

- Accepts ibis.Table and table_name, extracts backend and column names
- get_db_type() returns the backend name (e.g., "duckdb", "postgres")
- execute_query() uses check_query() for SQL injection protection and
  returns ibis.Table (lazy) for chaining additional operations
- get_data() returns the original table
- cleanup() is a no-op since Ibis manages connection lifecycle
- Stores _colnames for use by test_query() (to be implemented later)

Note: get_schema() and test_query() raise NotImplementedError for now;
they will be implemented in separate tasks.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement the get_schema() method for IbisSource that generates schema
information for the LLM prompt. The implementation:

- Classifies columns by type using Ibis dtype methods (is_numeric,
  is_string, is_date, is_timestamp)
- Uses a single aggregate query for efficiency to get min/max for
  numeric/date columns and nunique for text columns
- Shows categorical values for text columns with unique count below
  the threshold
- Includes _ibis_dtype_to_sql() helper to convert Ibis dtypes to
  SQL type names

The output format matches other DataSource implementations
(DataFrameSource, PolarsLazySource, SQLAlchemySource).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace NotImplementedError stub with real implementation that:
- Uses check_query() for SQL injection protection
- Wraps query in LIMIT 1 subquery to test without full execution
- Always collects (calls .execute()) to catch runtime errors
- Returns nw.DataFrame via nw.from_native() on executed result
- Validates all original columns present when require_all_columns=True
- Raises MissingColumnsError when columns are missing

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update AnyFrame type with TYPE_CHECKING guard to include ibis.Table
- Add Ibis Table detection in normalize_data_source()
- Update render boundary in app() to handle Ibis Tables via to_pandas()
- Export IbisSource and other DataSource classes from __init__.py

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test_querychat_with_ibis_table() that verifies QueryChat correctly
accepts an Ibis Table as a data source, creating an IbisSource and
executing queries that return ibis.Table objects.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add documentation for using Ibis Tables as a data source in querychat,
including examples for DuckDB, PostgreSQL, and BigQuery backends.
The section explains Ibis's value proposition (lazy evaluation, backend
flexibility, chainable operations) and provides guidance on when to
choose Ibis vs SQLAlchemy.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@cpsievert cpsievert force-pushed the feat/py-ibis-source branch from bdf7d54 to 1321c4e Compare January 14, 2026 19:06
cpsievert and others added 14 commits January 14, 2026 18:20
- Remove lazy_frame_demo.py example script
- Fix empty LazyFrame handling in get_schema to prevent .row() failure
- Add .head() limit when collecting unique values to reduce memory usage

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The abstract test_query() method was declared to return AnyFrame but all
concrete implementations return nw.DataFrame. This is intentional since
test_query collects data to catch runtime errors.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…LazyFrame support

- Batch categorical value collection into single scan using implode()
  instead of N separate scans (one per categorical column)
- Extract _get_categorical_values() helper method for clarity
- Rename AnyFrame to LazyOrDataFrame for better readability
- Store native Polars LazyFrame internally instead of narwhals wrapper
- Simplify df_to_html() implementation
- Improve error messages for unsupported LazyFrame backends

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eta dataclass

Introduce a ColumnMeta dataclass to consolidate column metadata into a
single data structure, replacing multiple parallel lists and dicts.

Changes:
- Add ColumnMeta dataclass with name, sql_type, kind, min_val, max_val, categories
- Refactor get_schema() into three clear steps: classify, add stats, format
- Extract static helper methods: _make_column_meta, _add_column_stats, _format_schema
- Use .row(0, named=True) consistently for extracting aggregate results
- Fix test to check native LazyFrame identity instead of wrapper identity

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update DataFrameSource to require narwhals DataFrame as input, removing
implicit conversion from raw pandas/polars DataFrames. Update all tests
to wrap DataFrames with nw.from_native().

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace specific benchmark numbers with qualitative explanation of lazy
evaluation benefits (deferred loading, query optimization, reduced memory).

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Source

Remove conditional that skipped range output when both min/max were None,
matching DataFrameSource behavior of always showing range info.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts and integrate LazyFrame support with new multi-framework
architecture:
- Update _querychat_base.py with PolarsLazySource support in normalize_data_source
- Add LazyFrame handling in _shiny.py (collect before render)
- Update _shiny_module.py with LazyOrDataFrame type
- Keep GT-based df_to_html in _utils.py
- Combine dev dependencies (polars) with new docs dependencies

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
# Conflicts:
#	pkg-py/CHANGELOG.md
#	pkg-py/src/querychat/_shiny.py
#	pkg-py/src/querychat/_shiny_module.py
#	pyproject.toml
- Update polars tests to wrap DataFrames with nw.from_native()
- Fix df_to_html test to match actual truncation message format

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… parameter

- Add overloads for proper type inference based on lazy parameter
- lazy=False (default) returns nw.DataFrame (collected)
- lazy=True returns nw.LazyFrame

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add defensive check for empty stats_row in _add_column_stats
- Remove redundant _backend.sql() call in test_query

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add is_pandas_df() TypeGuard and runtime validation that ibis.Table.execute()
returns a pandas DataFrame, with informative error message linking to issue tracker.

Also simplify as_narwhals() and fix st.dataframe formatting.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use as_narwhals() in _streamlit.py to properly handle Ibis Tables
- Clarify df() docstring in _shiny.py to describe return type matches input

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Avoid duplicate SQL parsing in IbisSource.test_query() by reusing
  the parsed result instead of calling _backend.sql() twice
- Handle Ibis Tables in Shiny app() dt() render using as_narwhals()
- Move nw import to TYPE_CHECKING block in _shiny.py
- Refactor DataFrameSource.get_schema() to use format_schema() helper
  for consistency with IbisSource and PolarsLazySource

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace pandas-specific methods with narwhals equivalents in
IbisSource._add_column_stats to avoid assuming execute() returns
a pandas DataFrame.

- Use as_narwhals() helper instead of direct pandas methods
- Replace .empty/.iloc[0].to_dict() with narwhals shape/row/zip
- Replace pandas indexing with narwhals filter/get_column
- Move nw import to TYPE_CHECKING in _streamlit.py (lint fix)
- Improve df() and cleanup() docstrings

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests for critical utility functions that handle Ibis Tables:

- test_utils_ibis.py: New file with 18 tests covering:
  - is_ibis_table() TypeGuard validation
  - as_narwhals() eager/lazy collection from Ibis Tables
  - df_to_html() HTML generation from Ibis Tables
  - Empty table edge cases

- test_ibis_source.py: Add TestIbisSourceValidation class with:
  - test_rejects_non_sql_backend: Validates TypeError for non-SQL backends

These tests cover the integration points where Ibis data flows through
utilities used by tools.py and all framework renderers.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add TestIbisSourceEdgeCases class with 9 new tests:

- test_empty_table_schema: Verify get_schema works with empty tables
- test_empty_table_execute_query: Verify queries on empty tables
- test_multiple_categorical_columns: Test UNION query path in _add_column_stats
- test_no_categorical_columns: Test early return path (numeric-only tables)
- test_column_with_all_nulls: Verify NULL handling doesn't crash
- test_high_cardinality_text_not_categorical: Test threshold exclusion
- test_categorical_at_threshold_boundary: Test exact boundary behavior
- test_cleanup_is_safe_noop: Verify cleanup() contract
- test_get_data_after_execute_query: Verify get_data returns original

These tests cover the complex _add_column_stats logic paths and edge
cases that could cause runtime failures.

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use as_narwhals() in test_query for consistent column access
- Improve error message for schema names type validation
- Move DType import to TYPE_CHECKING block

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

This comment was marked as resolved.

@cpsievert cpsievert merged commit ae5a07d into main Jan 16, 2026
7 checks passed
@cpsievert cpsievert deleted the feat/py-ibis-source branch January 16, 2026 22:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants