Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
56 commits
Select commit Hold shift + click to select a range
973a433
First attempt at genericizing data source
jcheng5 Apr 4, 2025
8de0ac7
Unify prompts by adding chevron Python dependency
jcheng5 Apr 4, 2025
53c7df3
Make prompt aware of what engine is being used
jcheng5 Apr 18, 2025
a2122f2
Replace SQLite support with SQLAlchemy support
jcheng5 Apr 18, 2025
a218fb9
Don't fail when given table name's case differs from SQLAlchemy Inspe…
jcheng5 Apr 23, 2025
dc0814e
Forgot import
jcheng5 May 1, 2025
9d95d1d
Have server() return proper class with typed methods, instead of dict
jcheng5 Jun 2, 2025
aeb87dd
Auto-create sqlite database for example
jcheng5 Jun 2, 2025
c38b567
Have init() take data frame or sqlalchemy engine directly
jcheng5 Jun 2, 2025
e7972e8
Merge remote-tracking branch 'origin/main' into generic-datasource-im…
jcheng5 Jun 3, 2025
57922b3
Use GPT-4.1 by default, not GPT-4, yuck
jcheng5 Jun 3, 2025
84d30ad
Merge remote-tracking branch 'origin/generic-datasource' into generic…
jcheng5 Jun 3, 2025
a08764b
Update README
jcheng5 Jun 3, 2025
374bdfb
this should significantly speed up schema generation
npelikan Jun 6, 2025
e294b1b
another speedup
npelikan Jun 6, 2025
b179ea6
ruff formatting
npelikan Jun 6, 2025
2cbe199
updating so formatting checks pass
npelikan Jun 6, 2025
8f59aa7
adding a generic r datasource
npelikan Jun 7, 2025
2ececf5
critical change: should return a lazy table rather than executing by …
npelikan Jun 7, 2025
f4ca445
edits to test suite and devtools::check() passing
npelikan Jun 7, 2025
c9b03da
Merge pull request #1 from posit-dev/main
npelikan Jun 7, 2025
48503f0
example update
npelikan Jun 7, 2025
4809615
error message for a footgun
npelikan Jun 9, 2025
a1ae3b6
Merge branch 'main' into r-generic-datasource
npelikan Jun 12, 2025
24ef182
Merge pull request #4 from npelikan/r-generic-datasource
npelikan Jun 12, 2025
3b289c7
update to use s3 classes to simplify the code
npelikan Jun 19, 2025
7052d6e
Merge pull request #5 from npelikan/r-generic-datasource
npelikan Jun 19, 2025
146777a
README update
npelikan Jun 19, 2025
9911965
added injection of SQL dialect into prompt. Also cleaned up test naming
npelikan Jun 19, 2025
8d05d7f
more simplification
npelikan Jun 19, 2025
b18b570
Merge branch 'main' into main
npelikan Jun 25, 2025
41c9e1e
merge fix
npelikan Jun 25, 2025
e347110
small dep edit
npelikan Jun 26, 2025
753c5af
Code review
jcheng5 Jun 26, 2025
1ee065b
more tests, and code review edits
npelikan Jun 26, 2025
5492b0f
testing changes
npelikan Jun 27, 2025
1ff4fe5
more test passing
npelikan Jun 27, 2025
eb9104c
cleaning up gitignores
npelikan Jun 27, 2025
09231fa
updating python datasource to prevent collisions
npelikan Jun 27, 2025
9e53ca3
Merge remote-tracking branch 'posit-dev/main'
npelikan Jul 1, 2025
150e550
fix for github actions
npelikan Jul 1, 2025
c589444
adding tests to python github action (as we have some tests now!)
npelikan Jul 1, 2025
98b2f29
edits for gha
npelikan Jul 1, 2025
3fd17e4
makefile edit
npelikan Jul 1, 2025
e6731be
air format
npelikan Jul 8, 2025
d45820f
code cleanup, better tests, and dropping `glue` dependency
npelikan Jul 9, 2025
3f55974
Fix error in qc.df() when no query is active
jcheng5 Jul 16, 2025
395e116
Adding dplyr::sql() identifier to get_lazy_query() to fix failing tests.
npelikan Jul 17, 2025
d86888d
adding more tests to cover the empty execute_data query use case and …
npelikan Jul 17, 2025
765250e
description edit to pass routine test
npelikan Jul 17, 2025
6432fa1
edit to remove `tbl` output per discussion on #28
npelikan Jul 28, 2025
de0a31e
better data source nested identifier handling
npelikan Jul 29, 2025
b6eeb4a
fixing a missing quote identifier
npelikan Jul 29, 2025
32a65fc
doc cleanup
npelikan Jul 29, 2025
1325ed1
a bit more helpful error message
npelikan Jul 29, 2025
0d01d82
even more helpful erroring
npelikan Jul 29, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions .github/workflows/py-test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -37,8 +37,8 @@ jobs:
- name: 📦 Install the project
run: uv sync --python ${{matrix.config.python-version }} --all-extras --all-groups

# - name: 🧪 Check tests
# run: make py-check-tests
- name: 🧪 Check tests
run: make py-check-tests

- name: 📝 Check types
run: make py-check-types
Expand Down
5 changes: 5 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -250,8 +250,13 @@ po/*~

# RStudio Connect folder
rsconnect/
python-package/CLAUDE.md

uv.lock
_dev

# R ignores
/.quarto/
.Rprofile
renv/
renv.lock
11 changes: 5 additions & 6 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -123,12 +123,11 @@ py-check-tox: ## [py] Run python 3.9 - 3.12 checks with tox
@echo "🔄 Running tests and type checking with tox for Python 3.9--3.12"
uv run tox run-parallel

# .PHONY: py-check-tests
# py-check-tests: ## [py] Run python tests
# @echo ""
# @echo "🧪 Running tests with pytest"
# uv run playwright install
# uv run pytest
.PHONY: py-check-tests
py-check-tests: ## [py] Run python tests
@echo ""
@echo "🧪 Running tests with pytest"
uv run pytest

.PHONY: py-check-types
py-check-types: ## [py] Run python type checks
Expand Down
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,11 +36,11 @@ querychat does not have direct access to the raw data; it can _only_ read or fil
- **Transparency:** querychat always displays the SQL to the user, so it can be vetted instead of blindly trusted.
- **Reproducibility:** The SQL query can be easily copied and reused.

Currently, querychat uses DuckDB for its SQL engine. It's extremely fast and has a surprising number of statistical functions.
Currently, querychat uses DuckDB for its SQL engine when working with data frames. For database sources, it uses the native SQL dialect of the connected database.

## Language-specific Documentation

For detailed information on how to use querychat in your preferred language, see the language-specific READMEs:

- [R Documentation](pkg-r/README.md)
- [Python Documentation](pkg-py/README.md)
- [Python Documentation](pkg-py/README.md)
2 changes: 1 addition & 1 deletion pkg-py/examples/app.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,4 +49,4 @@ def data_table():


# Create Shiny app
app = App(app_ui, server)
app = App(app_ui, server)
14 changes: 11 additions & 3 deletions pkg-py/src/querychat/__init__.py
Original file line number Diff line number Diff line change
@@ -1,5 +1,13 @@
from querychat.querychat import init, sidebar, system_prompt
from querychat.querychat import mod_server as server
from querychat.querychat import mod_ui as ui
from querychat.querychat import (
init,
sidebar,
system_prompt,
)
from querychat.querychat import (
mod_server as server,
)
from querychat.querychat import (
mod_ui as ui,
)

__all__ = ["init", "server", "sidebar", "system_prompt", "ui"]
134 changes: 98 additions & 36 deletions pkg-py/src/querychat/datasource.py
Original file line number Diff line number Diff line change
Expand Up @@ -178,7 +178,7 @@ def __init__(self, engine: Engine, table_name: str):
if not inspector.has_table(table_name):
raise ValueError(f"Table '{table_name}' not found in database")

def get_schema(self, *, categorical_threshold: int) -> str:
def get_schema(self, *, categorical_threshold: int) -> str: # noqa: PLR0912
"""
Generate schema information from database table.

Expand All @@ -191,12 +191,15 @@ def get_schema(self, *, categorical_threshold: int) -> str:

schema = [f"Table: {self._table_name}", "Columns:"]

# Build a single query to get all column statistics
select_parts = []
numeric_columns = []
text_columns = []

for col in columns:
# Get SQL type name
sql_type = self._get_sql_type_name(col["type"])
column_info = [f"- {col['name']} ({sql_type})"]
col_name = col["name"]

# For numeric columns, try to get range
# Check if column is numeric
if isinstance(
col["type"],
(
Expand All @@ -208,44 +211,103 @@ def get_schema(self, *, categorical_threshold: int) -> str:
sqltypes.DateTime,
sqltypes.BigInteger,
sqltypes.SmallInteger,
# sqltypes.Interval,
),
):
try:
query = text(
f"SELECT MIN({col['name']}), MAX({col['name']}) FROM {self._table_name}",
)
with self._get_connection() as conn:
result = conn.execute(query).fetchone()
if result and result[0] is not None and result[1] is not None:
column_info.append(f" Range: {result[0]} to {result[1]}")
except Exception: # noqa: S110
pass # Silently skip range info if query fails

# For string/text columns, check if categorical
numeric_columns.append(col_name)
select_parts.extend(
[
f"MIN({col_name}) as {col_name}__min",
f"MAX({col_name}) as {col_name}__max",
],
)

# Check if column is text/string
elif isinstance(
col["type"],
(sqltypes.String, sqltypes.Text, sqltypes.Enum),
):
try:
count_query = text(
f"SELECT COUNT(DISTINCT {col['name']}) FROM {self._table_name}",
)
text_columns.append(col_name)
select_parts.append(
f"COUNT(DISTINCT {col_name}) as {col_name}__distinct_count",
)

# Execute single query to get all statistics
column_stats = {}
if select_parts:
try:
stats_query = text(
f"SELECT {', '.join(select_parts)} FROM {self._table_name}",
)
with self._get_connection() as conn:
result = conn.execute(stats_query).fetchone()
if result:
# Convert result to dict for easier access
column_stats = dict(zip(result._fields, result))
except Exception: # noqa: S110
pass # Fall back to no statistics if query fails

# Get categorical values for text columns that are below threshold
categorical_values = {}
text_cols_to_query = []
for col_name in text_columns:
distinct_count_key = f"{col_name}__distinct_count"
if (
distinct_count_key in column_stats
and column_stats[distinct_count_key]
and column_stats[distinct_count_key] <= categorical_threshold
):
text_cols_to_query.append(col_name)

# Get categorical values in a single query if needed
if text_cols_to_query:
try:
# Build UNION query for all categorical columns
union_parts = [
f"SELECT '{col_name}' as column_name, {col_name} as value "
f"FROM {self._table_name} WHERE {col_name} IS NOT NULL "
f"GROUP BY {col_name}"
for col_name in text_cols_to_query
]

if union_parts:
categorical_query = text(" UNION ALL ".join(union_parts))
with self._get_connection() as conn:
distinct_count = conn.execute(count_query).scalar()
if distinct_count and distinct_count <= categorical_threshold:
values_query = text(
f"SELECT DISTINCT {col['name']} FROM {self._table_name} "
f"WHERE {col['name']} IS NOT NULL",
)
values = [
str(row[0])
for row in conn.execute(values_query).fetchall()
]
values_str = ", ".join([f"'{v}'" for v in values])
column_info.append(f" Categorical values: {values_str}")
except Exception: # noqa: S110
pass # Silently skip categorical info if query fails
results = conn.execute(categorical_query).fetchall()
for row in results:
col_name, value = row
if col_name not in categorical_values:
categorical_values[col_name] = []
categorical_values[col_name].append(str(value))
except Exception: # noqa: S110
pass # Skip categorical values if query fails

# Build schema description using collected statistics
for col in columns:
col_name = col["name"]
sql_type = self._get_sql_type_name(col["type"])
column_info = [f"- {col_name} ({sql_type})"]

# Add range info for numeric columns
if col_name in numeric_columns:
min_key = f"{col_name}__min"
max_key = f"{col_name}__max"
if (
min_key in column_stats
and max_key in column_stats
and column_stats[min_key] is not None
and column_stats[max_key] is not None
):
column_info.append(
f" Range: {column_stats[min_key]} to {column_stats[max_key]}",
)

# Add categorical values for text columns
elif col_name in categorical_values:
values = categorical_values[col_name]
# Remove duplicates and sort
unique_values = sorted(set(values))
values_str = ", ".join([f"'{v}'" for v in unique_values])
column_info.append(f" Categorical values: {values_str}")

schema.extend(column_info)

Expand Down
19 changes: 6 additions & 13 deletions pkg-py/src/querychat/querychat.py
Original file line number Diff line number Diff line change
Expand Up @@ -118,19 +118,12 @@ def __getitem__(self, key: str) -> Any:
backwards compatibility only; new code should use the attributes
directly instead.
"""
if key == "chat": # noqa: SIM116
return self.chat
elif key == "sql":
return self.sql
elif key == "title":
return self.title
elif key == "df":
return self.df

raise KeyError(
f"`QueryChat` does not have a key `'{key}'`. "
"Use the attributes `chat`, `sql`, `title`, or `df` instead.",
)
return {
"chat": self.chat,
"sql": self.sql,
"title": self.title,
"df": self.df,
}.get(key)


def system_prompt(
Expand Down
Empty file added pkg-py/tests/__init__.py
Empty file.
Loading
Loading