Skip to content

Commit 53c9872

Browse files
cpsievertclaude
andauthored
feat(pkg-py): Replace pandas with narwhals (#175)
* feat(pkg-py): Replace pandas dependency with narwhals abstraction layer Remove pandas as a required dependency in favor of narwhals, which provides a unified DataFrame interface supporting both pandas and polars backends. Changes: - Add _df_compat.py module with read_csv, read_sql, and duckdb_result_to_nw helpers - Update DataSource classes to return narwhals DataFrames - Update df_to_html to generate HTML without pandas dependency - Make pandas and polars optional dependencies - Add comprehensive tests for DataFrameSource and df_compat module Users can now install with either `pip install querychat[pandas]` or `pip install querychat[polars]`. Use `.to_native()` on returned DataFrames to get the underlying pandas or polars DataFrame. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * Update CLAUDE.md * feat(pkg-py): Use great_tables for DataFrame HTML rendering Replace manual HTML table construction in df_to_html() with great_tables GT class for richer, styled table output in chat messages. - Add great-tables>=0.16.0 as a dependency - Simplify df_to_html() to use GT().as_raw_html() - Remove manual _escape_html helper (great_tables handles escaping) Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> * fix(pkg-py): Pass client method reference to mod_server Passing self.client() (calling the method) resulted in tools being registered twice - once in client() and again in mod_server. Instead, pass self.client (the method reference) so mod_server can call it with the update_dashboard and reset_dashboard callbacks. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.5 <noreply@anthropic.com>
1 parent 9ecc34a commit 53c9872

17 files changed

+632
-118
lines changed

CLAUDE.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -69,6 +69,14 @@ make py-build
6969
make py-docs
7070
```
7171

72+
Before finishing your implementation or committing any code, you should run:
73+
74+
```bash
75+
uv run ruff check --fix pkg-py --config pyproject.toml
76+
```
77+
78+
To get help with making sure code adheres to project standards.
79+
7280
### R Package
7381

7482
```bash

pkg-py/CHANGELOG.md

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -7,6 +7,12 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
77

88
## [UNRELEASED]
99

10+
### Breaking Changes
11+
12+
* Methods like `execute_query()`, `get_data()`, and `df()` now return a `narwhals.DataFrame` instead of a `pandas.DataFrame`. This allows querychat to drop its `pandas` dependency, and for you to use any `narwhals`-compatible dataframe of your choosing.
13+
* If this breaks existing code, note you can call `.to_native()` on the new dataframe value to get your `pandas` dataframe back.
14+
* Note that `polars` or `pandas` will be needed to realize a `sqlalchemy` connection query as a dataframe. Install with `pip install querychat[pandas]` or `pip install querychat[polars]`
15+
1016
### New features
1117

1218
* `QueryChat.sidebar()`, `QueryChat.ui()`, and `QueryChat.server()` now support an optional `id` parameter to create multiple chat instances from a single `QueryChat` object. (#172)

pkg-py/docs/build.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -220,7 +220,7 @@ with ui.layout_columns():
220220

221221
@render_plotly
222222
def survival_plot():
223-
d = qc.df()
223+
d = qc.df().to_native() # Convert for pandas groupby()
224224
summary = d.groupby('pclass')['survived'].mean().reset_index()
225225
return px.bar(summary, x='pclass', y='survived')
226226
```

pkg-py/docs/data-sources.qmd

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -63,7 +63,7 @@ app = qc.app()
6363

6464
:::
6565

66-
If you're [building an app](build.qmd), note you can read the queried data frame reactively using the `df()` method, which returns a `pandas.DataFrame` by default.
66+
If you're [building an app](build.qmd), note you can read the queried data frame reactively using the `df()` method, which returns a `narwhals.DataFrame`. Call `.to_native()` on the result to get the underlying pandas or polars DataFrame.
6767

6868
## Databases
6969

pkg-py/src/querychat/_datasource.py

Lines changed: 41 additions & 51 deletions
Original file line numberDiff line numberDiff line change
@@ -5,14 +5,13 @@
55

66
import duckdb
77
import narwhals.stable.v1 as nw
8-
import pandas as pd
98
from sqlalchemy import inspect, text
109
from sqlalchemy.sql import sqltypes
1110

11+
from ._df_compat import duckdb_result_to_nw, read_sql
1212
from ._utils import check_query
1313

1414
if TYPE_CHECKING:
15-
from narwhals.stable.v1.typing import IntoFrame
1615
from sqlalchemy.engine import Connection, Engine
1716

1817

@@ -59,7 +58,7 @@ def get_schema(self, *, categorical_threshold: int) -> str:
5958
...
6059

6160
@abstractmethod
62-
def execute_query(self, query: str) -> pd.DataFrame:
61+
def execute_query(self, query: str) -> nw.DataFrame:
6362
"""
6463
Execute SQL query and return results as DataFrame.
6564
@@ -71,15 +70,15 @@ def execute_query(self, query: str) -> pd.DataFrame:
7170
Returns
7271
-------
7372
:
74-
Query results as a pandas DataFrame
73+
Query results as a narwhals DataFrame
7574
7675
"""
7776
...
7877

7978
@abstractmethod
8079
def test_query(
8180
self, query: str, *, require_all_columns: bool = False
82-
) -> pd.DataFrame:
81+
) -> nw.DataFrame:
8382
"""
8483
Test SQL query by fetching only one row.
8584
@@ -94,7 +93,7 @@ def test_query(
9493
Returns
9594
-------
9695
:
97-
Query results as a pandas DataFrame with at most one row
96+
Query results as a narwhals DataFrame with at most one row
9897
9998
Raises
10099
------
@@ -105,14 +104,14 @@ def test_query(
105104
...
106105

107106
@abstractmethod
108-
def get_data(self) -> pd.DataFrame:
107+
def get_data(self) -> nw.DataFrame:
109108
"""
110109
Return the unfiltered data as a DataFrame.
111110
112111
Returns
113112
-------
114113
:
115-
The complete dataset as a pandas DataFrame
114+
The complete dataset as a narwhals DataFrame
116115
117116
"""
118117
...
@@ -133,28 +132,27 @@ def cleanup(self) -> None:
133132

134133

135134
class DataFrameSource(DataSource):
136-
"""A DataSource implementation that wraps a pandas DataFrame using DuckDB."""
135+
"""A DataSource implementation that wraps a DataFrame using DuckDB."""
137136

138-
_df: nw.DataFrame | nw.LazyFrame
137+
_df: nw.DataFrame
139138

140-
def __init__(self, df: IntoFrame, table_name: str):
139+
def __init__(self, df: nw.DataFrame, table_name: str):
141140
"""
142-
Initialize with a pandas DataFrame.
141+
Initialize with a DataFrame.
143142
144143
Parameters
145144
----------
146145
df
147-
The DataFrame to wrap
146+
The DataFrame to wrap (pandas, polars, or any narwhals-compatible frame)
148147
table_name
149148
Name of the table in SQL queries
150149
151150
"""
152-
self._df = nw.from_native(df)
151+
self._df = nw.from_native(df) if not isinstance(df, nw.DataFrame) else df
153152
self.table_name = table_name
154153

155154
self._conn = duckdb.connect(database=":memory:")
156-
# TODO(@gadenbuie): What if the data frame is already SQL-backed?
157-
self._conn.register(table_name, self._df.lazy().collect().to_pandas())
155+
self._conn.register(table_name, self._df.to_native())
158156
self._conn.execute("""
159157
-- extensions: lock down supply chain + auto behaviors
160158
SET allow_community_extensions = false;
@@ -203,16 +201,8 @@ def get_schema(self, *, categorical_threshold: int) -> str:
203201
"""
204202
schema = [f"Table: {self.table_name}", "Columns:"]
205203

206-
# Ensure we're working with a DataFrame, not a LazyFrame
207-
ndf = (
208-
self._df.head(10).collect()
209-
if isinstance(self._df, nw.LazyFrame)
210-
else self._df
211-
)
212-
213-
for column in ndf.columns:
214-
# Map pandas dtypes to SQL-like types
215-
dtype = ndf[column].dtype
204+
for column in self._df.columns:
205+
dtype = self._df[column].dtype
216206
if dtype.is_integer():
217207
sql_type = "INTEGER"
218208
elif dtype.is_float():
@@ -228,17 +218,14 @@ def get_schema(self, *, categorical_threshold: int) -> str:
228218

229219
column_info = [f"- {column} ({sql_type})"]
230220

231-
# For TEXT columns, check if they're categorical
232221
if sql_type == "TEXT":
233-
unique_values = ndf[column].drop_nulls().unique()
222+
unique_values = self._df[column].drop_nulls().unique()
234223
if unique_values.len() <= categorical_threshold:
235224
categories = unique_values.to_list()
236225
categories_str = ", ".join([f"'{c}'" for c in categories])
237226
column_info.append(f" Categorical values: {categories_str}")
238-
239-
# For numeric columns, include range
240227
elif sql_type in ["INTEGER", "FLOAT", "DATE", "TIME"]:
241-
rng = ndf[column].min(), ndf[column].max()
228+
rng = self._df[column].min(), self._df[column].max()
242229
if rng[0] is None and rng[1] is None:
243230
column_info.append(" Range: NULL to NULL")
244231
else:
@@ -248,10 +235,12 @@ def get_schema(self, *, categorical_threshold: int) -> str:
248235

249236
return "\n".join(schema)
250237

251-
def execute_query(self, query: str) -> pd.DataFrame:
238+
def execute_query(self, query: str) -> nw.DataFrame:
252239
"""
253240
Execute query using DuckDB.
254241
242+
Uses polars if available, otherwise falls back to pandas.
243+
255244
Parameters
256245
----------
257246
query
@@ -260,7 +249,7 @@ def execute_query(self, query: str) -> pd.DataFrame:
260249
Returns
261250
-------
262251
:
263-
Query results as pandas DataFrame
252+
Query results as narwhals DataFrame
264253
265254
Raises
266255
------
@@ -269,11 +258,11 @@ def execute_query(self, query: str) -> pd.DataFrame:
269258
270259
"""
271260
check_query(query)
272-
return self._conn.execute(query).df()
261+
return duckdb_result_to_nw(self._conn.execute(query))
273262

274263
def test_query(
275264
self, query: str, *, require_all_columns: bool = False
276-
) -> pd.DataFrame:
265+
) -> nw.DataFrame:
277266
"""
278267
Test query by fetching only one row.
279268
@@ -298,7 +287,7 @@ def test_query(
298287
299288
"""
300289
check_query(query)
301-
result = self._conn.execute(f"{query} LIMIT 1").df()
290+
result = duckdb_result_to_nw(self._conn.execute(f"{query} LIMIT 1"))
302291

303292
if require_all_columns:
304293
result_columns = set(result.columns)
@@ -316,18 +305,17 @@ def test_query(
316305

317306
return result
318307

319-
def get_data(self) -> pd.DataFrame:
308+
def get_data(self) -> nw.DataFrame:
320309
"""
321310
Return the unfiltered data as a DataFrame.
322311
323312
Returns
324313
-------
325314
:
326-
The complete dataset as a pandas DataFrame
315+
The complete dataset as a narwhals DataFrame
327316
328317
"""
329-
# TODO(@gadenbuie): This should just return `self._df` and not a pandas DataFrame
330-
return self._df.lazy().collect().to_pandas()
318+
return self._df
331319

332320
def cleanup(self) -> None:
333321
"""
@@ -519,10 +507,12 @@ def get_schema(self, *, categorical_threshold: int) -> str: # noqa: PLR0912
519507

520508
return "\n".join(schema)
521509

522-
def execute_query(self, query: str) -> pd.DataFrame:
510+
def execute_query(self, query: str) -> nw.DataFrame:
523511
"""
524512
Execute SQL query and return results as DataFrame.
525513
514+
Uses polars if available, otherwise falls back to pandas.
515+
526516
Parameters
527517
----------
528518
query
@@ -531,7 +521,7 @@ def execute_query(self, query: str) -> pd.DataFrame:
531521
Returns
532522
-------
533523
:
534-
Query results as pandas DataFrame
524+
Query results as narwhals DataFrame
535525
536526
Raises
537527
------
@@ -541,11 +531,11 @@ def execute_query(self, query: str) -> pd.DataFrame:
541531
"""
542532
check_query(query)
543533
with self._get_connection() as conn:
544-
return pd.read_sql_query(text(query), conn)
534+
return read_sql(text(query), conn)
545535

546536
def test_query(
547537
self, query: str, *, require_all_columns: bool = False
548-
) -> pd.DataFrame:
538+
) -> nw.DataFrame:
549539
"""
550540
Test query by fetching only one row.
551541
@@ -571,16 +561,16 @@ def test_query(
571561
"""
572562
check_query(query)
573563
with self._get_connection() as conn:
574-
# Use pandas read_sql_query with limit to get at most one row
564+
# Use read_sql with limit to get at most one row
575565
limit_query = f"SELECT * FROM ({query}) AS subquery LIMIT 1"
576566
try:
577-
df = pd.read_sql_query(text(limit_query), conn)
567+
result = read_sql(text(limit_query), conn)
578568
except Exception:
579569
# If LIMIT syntax doesn't work, fall back to regular read and take first row
580-
df = pd.read_sql_query(text(query), conn).head(1)
570+
result = read_sql(text(query), conn).head(1)
581571

582572
if require_all_columns:
583-
result_columns = set(df.columns)
573+
result_columns = set(result.columns)
584574
original_columns_set = set(self._colnames)
585575
missing_columns = original_columns_set - result_columns
586576

@@ -595,16 +585,16 @@ def test_query(
595585
f"Original columns: {original_list}"
596586
)
597587

598-
return df
588+
return result
599589

600-
def get_data(self) -> pd.DataFrame:
590+
def get_data(self) -> nw.DataFrame:
601591
"""
602592
Return the unfiltered data as a DataFrame.
603593
604594
Returns
605595
-------
606596
:
607-
The complete dataset as a pandas DataFrame
597+
The complete dataset as a narwhals DataFrame
608598
609599
"""
610600
return self.execute_query(f"SELECT * FROM {self.table_name}")

0 commit comments

Comments
 (0)