-
Notifications
You must be signed in to change notification settings - Fork 25
feat(pkg-py): Add IbisSource for Ibis Table support #193
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add a new DataSource implementation that keeps Polars LazyFrames lazy
until the render boundary. Key changes:
- Add `AnyFrame` type alias (`Union[nw.DataFrame, nw.LazyFrame]`)
- Widen DataSource ABC return types to support lazy frames
- Implement `PolarsLazySource` using Polars SQLContext for lazy SQL
- Update `normalize_data_source()` to detect and route LazyFrames
- Collect LazyFrames at render boundary in `app()` method
- Update type hints throughout
Usage:
```python
import polars as pl
from querychat import QueryChat
lf = pl.scan_parquet("large_data.parquet")
qc = QueryChat(data_source=lf, table_name="data")
```
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
PolarsLazySource._polars_dtype_to_sql was mapping pl.Time to "TIMESTAMP" but it should map to "TIME". Time-only values are not timestamps. Also added noqa comment for PLR0911 (too many return statements) since the function now has 7 return statements after the fix. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Previously, test_query only validated schema structure via collect_schema() without executing the query. This meant runtime errors (e.g., invalid casts) wouldn't surface until actual collection. Now test_query collects one row to catch runtime errors, matching the behavior of DataFrameSource.test_query. The return type changes from LazyFrame to DataFrame since we've already done the work. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The noqa: A005 comment was accidentally removed from types/__init__.py. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add IbisSource class to _datasource.py that wraps Ibis Tables for use with QueryChat. Key features: - Accepts ibis.Table and table_name, extracts backend and column names - get_db_type() returns the backend name (e.g., "duckdb", "postgres") - execute_query() uses check_query() for SQL injection protection and returns ibis.Table (lazy) for chaining additional operations - get_data() returns the original table - cleanup() is a no-op since Ibis manages connection lifecycle - Stores _colnames for use by test_query() (to be implemented later) Note: get_schema() and test_query() raise NotImplementedError for now; they will be implemented in separate tasks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Implement the get_schema() method for IbisSource that generates schema information for the LLM prompt. The implementation: - Classifies columns by type using Ibis dtype methods (is_numeric, is_string, is_date, is_timestamp) - Uses a single aggregate query for efficiency to get min/max for numeric/date columns and nunique for text columns - Shows categorical values for text columns with unique count below the threshold - Includes _ibis_dtype_to_sql() helper to convert Ibis dtypes to SQL type names The output format matches other DataSource implementations (DataFrameSource, PolarsLazySource, SQLAlchemySource). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace NotImplementedError stub with real implementation that: - Uses check_query() for SQL injection protection - Wraps query in LIMIT 1 subquery to test without full execution - Always collects (calls .execute()) to catch runtime errors - Returns nw.DataFrame via nw.from_native() on executed result - Validates all original columns present when require_all_columns=True - Raises MissingColumnsError when columns are missing Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Update AnyFrame type with TYPE_CHECKING guard to include ibis.Table - Add Ibis Table detection in normalize_data_source() - Update render boundary in app() to handle Ibis Tables via to_pandas() - Export IbisSource and other DataSource classes from __init__.py Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add test_querychat_with_ibis_table() that verifies QueryChat correctly accepts an Ibis Table as a data source, creating an IbisSource and executing queries that return ibis.Table objects. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add documentation for using Ibis Tables as a data source in querychat, including examples for DuckDB, PostgreSQL, and BigQuery backends. The section explains Ibis's value proposition (lazy evaluation, backend flexibility, chainable operations) and provides guidance on when to choose Ibis vs SQLAlchemy. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
bdf7d54 to
1321c4e
Compare
- Remove lazy_frame_demo.py example script - Fix empty LazyFrame handling in get_schema to prevent .row() failure - Add .head() limit when collecting unique values to reduce memory usage Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
The abstract test_query() method was declared to return AnyFrame but all concrete implementations return nw.DataFrame. This is intentional since test_query collects data to catch runtime errors. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…LazyFrame support - Batch categorical value collection into single scan using implode() instead of N separate scans (one per categorical column) - Extract _get_categorical_values() helper method for clarity - Rename AnyFrame to LazyOrDataFrame for better readability - Store native Polars LazyFrame internally instead of narwhals wrapper - Simplify df_to_html() implementation - Improve error messages for unsupported LazyFrame backends Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…eta dataclass Introduce a ColumnMeta dataclass to consolidate column metadata into a single data structure, replacing multiple parallel lists and dicts. Changes: - Add ColumnMeta dataclass with name, sql_type, kind, min_val, max_val, categories - Refactor get_schema() into three clear steps: classify, add stats, format - Extract static helper methods: _make_column_meta, _add_column_stats, _format_schema - Use .row(0, named=True) consistently for extracting aggregate results - Fix test to check native LazyFrame identity instead of wrapper identity Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Update DataFrameSource to require narwhals DataFrame as input, removing implicit conversion from raw pandas/polars DataFrames. Update all tests to wrap DataFrames with nw.from_native(). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace specific benchmark numbers with qualitative explanation of lazy evaluation benefits (deferred loading, query optimization, reduced memory). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
…Source Remove conditional that skipped range output when both min/max were None, matching DataFrameSource behavior of always showing range info. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Resolve conflicts and integrate LazyFrame support with new multi-framework architecture: - Update _querychat_base.py with PolarsLazySource support in normalize_data_source - Add LazyFrame handling in _shiny.py (collect before render) - Update _shiny_module.py with LazyOrDataFrame type - Keep GT-based df_to_html in _utils.py - Combine dev dependencies (polars) with new docs dependencies Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
# Conflicts: # pkg-py/CHANGELOG.md # pkg-py/src/querychat/_shiny.py # pkg-py/src/querychat/_shiny_module.py # pyproject.toml
- Update polars tests to wrap DataFrames with nw.from_native() - Fix df_to_html test to match actual truncation message format Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
… parameter - Add overloads for proper type inference based on lazy parameter - lazy=False (default) returns nw.DataFrame (collected) - lazy=True returns nw.LazyFrame Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Add defensive check for empty stats_row in _add_column_stats - Remove redundant _backend.sql() call in test_query Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add is_pandas_df() TypeGuard and runtime validation that ibis.Table.execute() returns a pandas DataFrame, with informative error message linking to issue tracker. Also simplify as_narwhals() and fix st.dataframe formatting. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use as_narwhals() in _streamlit.py to properly handle Ibis Tables - Clarify df() docstring in _shiny.py to describe return type matches input Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Avoid duplicate SQL parsing in IbisSource.test_query() by reusing the parsed result instead of calling _backend.sql() twice - Handle Ibis Tables in Shiny app() dt() render using as_narwhals() - Move nw import to TYPE_CHECKING block in _shiny.py - Refactor DataFrameSource.get_schema() to use format_schema() helper for consistency with IbisSource and PolarsLazySource Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Replace pandas-specific methods with narwhals equivalents in IbisSource._add_column_stats to avoid assuming execute() returns a pandas DataFrame. - Use as_narwhals() helper instead of direct pandas methods - Replace .empty/.iloc[0].to_dict() with narwhals shape/row/zip - Replace pandas indexing with narwhals filter/get_column - Move nw import to TYPE_CHECKING in _streamlit.py (lint fix) - Improve df() and cleanup() docstrings Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add tests for critical utility functions that handle Ibis Tables: - test_utils_ibis.py: New file with 18 tests covering: - is_ibis_table() TypeGuard validation - as_narwhals() eager/lazy collection from Ibis Tables - df_to_html() HTML generation from Ibis Tables - Empty table edge cases - test_ibis_source.py: Add TestIbisSourceValidation class with: - test_rejects_non_sql_backend: Validates TypeError for non-SQL backends These tests cover the integration points where Ibis data flows through utilities used by tools.py and all framework renderers. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Add TestIbisSourceEdgeCases class with 9 new tests: - test_empty_table_schema: Verify get_schema works with empty tables - test_empty_table_execute_query: Verify queries on empty tables - test_multiple_categorical_columns: Test UNION query path in _add_column_stats - test_no_categorical_columns: Test early return path (numeric-only tables) - test_column_with_all_nulls: Verify NULL handling doesn't crash - test_high_cardinality_text_not_categorical: Test threshold exclusion - test_categorical_at_threshold_boundary: Test exact boundary behavior - test_cleanup_is_safe_noop: Verify cleanup() contract - test_get_data_after_execute_query: Verify get_data returns original These tests cover the complex _add_column_stats logic paths and edge cases that could cause runtime failures. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
- Use as_narwhals() in test_query for consistent column access - Improve error message for schema names type validation - Move DType import to TYPE_CHECKING block Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Summary
IbisSourceDataSource that accepts Ibis Tables and keeps queries lazyexecute_query()allowing users to chain additional operationsUsage
When building a custom app, the
df()method returns a lazy Ibis Table that can be chained with additional operations:Changes
ibisoptional dependency with pandas for render boundarynormalize_data_source()Test Plan
uv run pytest pkg-py/tests/test_ibis_source.py pkg-py/tests/test_querychat.py -vNote: This PR depends on #191 (PolarsLazySource) and should be merged after it.
🤖 Generated with Claude Code