Skip to content

Conversation

@npelikan
Copy link
Contributor

@npelikan npelikan commented Jun 7, 2025

This is a WIP but seems to basically work.

One question -- I retained the basic functionality for local data.frames (that is, df -> querychat -> df), where remote data sources instead return a dbplyr lazy tbl(), meant for chaining. Is this too confusing of a behavior split? Should local data.frames also return a tbl(), just now connected to duckdb?

A few immediate TODOs:

  • add more documentation
  • generate some examples
  • create shinytests
  • validate this works on more than just sqlite

jcheng5 and others added 23 commits April 3, 2025 22:47
...instead of requiring explicit DataSource subclass creation
…provements

Plus some improvements:
- Cleaner .md file reading code in example apps
- Use GPT-4.1 by default, not GPT-4 😬
- Make sqlalchemy required
fix: No longer need to manually calls session$ns() with shinychat (#1
@npelikan npelikan marked this pull request as draft June 10, 2025 02:24
@npelikan npelikan marked this pull request as ready for review June 10, 2025 02:25
@schloerke schloerke marked this pull request as draft June 10, 2025 13:42
@schloerke schloerke changed the title DRAFT: R generic datasources feat(r): Generic datasources Jun 10, 2025
@npelikan
Copy link
Contributor Author

Alright, I think this is ready to merge @jcheng5

npelikan and others added 11 commits July 1, 2025 17:39
Previously, the examples/app-database.R would shown an error on
startup because the initial query was "", which was then sent
as a SQL query to RSQLite. The get_lazy_data code path accounted
for the "" query, so we decided to make the eager code path just
call the lazy code path, then collect().

Also fixed a formatting issue with the table.
It seems like dbplyr tables-as-queries can be a bit... temperamental. This should fix that by explicitly declaring sql always.
@jcheng5
Copy link
Collaborator

jcheng5 commented Jul 17, 2025

Shooot, there is one problem left. We're now using dplyr::tbl(conn, dplyr::sql(query)) to actually perform the SQL queries. The problem is that dplyr assumes that second argument is a table name, not a whole SQL query. So little things that are valid for SQL queries, break; for example, a trailing semicolon, or ending on a SQL comment. You end up seeing pretty confusing error messages with unrecognizable SQL like:

SELECT * FROM (SELECT * FROM iris --test) AS `q01` WHERE (0 = 1)

I'll ask Hadley what the right way to do this is (if there is one).

@jcheng5
Copy link
Collaborator

jcheng5 commented Jul 21, 2025

After thinking about this further, I don't think having get_lazy_data return a dplyr::tbl is the right way to go. Instead, I think it's DBI::DBIResult that we want.

Besides the challenge above of dplyr::tbl not really being designed to take a whole SQL query as the second argument, dplyr::collect() inhales the entire dataset into memory in one step (correct me if I'm wrong) so doesn't help in cases where the data is extremely large. (Arguably not a great case for querychat anyway.) In contrast, DBI::DBIResult lets you fetch a chunk at a time. The main downside is not being able to easily apply more dplyr verbs--probably use the eager df() for that.

@npelikan I know you're off this week, and I'm off next week, but if you agree with my assessment and have Daniel review your code, you guys can merge in my absence.

@npelikan
Copy link
Contributor Author

npelikan commented Jul 27, 2025

@jcheng5 I generally agree -- but my takeaway is a slightly different approach. In general, my goal with returning a dplyr::tbl() has been to allow for dplyr method-chaining and lazy execution(example), which IMO gives users a very happy path towards using querychat on very large datasets, with easy implementation of pushing down compute to the database (imo an essential best practice).

We've already had great luck with this with Querychat Python+Ibis on a VERY large dataset (example). I've had some time to dig into the dbplyr code and examples that twigged me onto this potential functionality (mentioned in docs here), and I think this is a gap in dbplyr functionality. So my suggested path forward here is to:

  1. For purposes of getting this R capability into the wild, update this PR so that the R functionality exactly matches the (current) Python functionality. querychat::querychat_server will return df (shiny reactive that returns an eagerly-executed data.frame) and sql (shiny reactive that returns the current sql query).
  2. File an issue with dbplyr to allow method chaining from any SELECT statement (goal would be parity with Ibis). dbplyr should be able to defend against any LLM-generated oddities in the SQL such as inline comments or trailing semicolons. Given how effective LLMs are at generating SQL, IMO this is a no-brainer addition, as it'd allow LLM outputs to be interacted with using a user-friendly dbplyr API.
  3. Create an issue for future work in this repo on the python package, to create a to_ibis() method on querychat.QueryChat. This should be easily accomplished with ibis right now.
  4. Create an issue for future work in this repo on the R package, to create a tbl output on querychat::querychat_server(). This would be contingent on item 2 above.

@npelikan
Copy link
Contributor Author

npelikan commented Jul 27, 2025

Another potential solution to Item 2 could be to implement the defensive programming ourselves, and have it feed sensible errors back to the LLM tools when necessary. I did some reprex-ing in preparation for an issue submission and found that the issue seems to be only around trailing comments on single-line queries and trailing semicolons. It seems that we could solve this with a few simple gsubs -- my reprex is below. I still think, just in the interest of getting this out there, we move forward with just an eager df+sql implementation and keep working on dplyr-chaining support.

library(dplyr)
library(RSQLite)
library(dplyr)
library(DBI)


test_df <- data.frame(
    id = 1:5,
    value = c(10, 20, 30, 40, 50),
    stringsAsFactors = FALSE
  )

temp_db <- tempfile(fileext = ".db")
conn <- dbConnect(RSQLite::SQLite(), temp_db)
dbWriteTable(conn, "test_table", test_df, overwrite = TRUE)

# Test with a simple select statement. Works.
simple_select <- tbl(conn, sql("SELECT * FROM test_table WHERE value > 20"))


# Multiline select statement. Works.
multiline_select <- sql("
SELECT *
FROM test_table
WHERE value > 20
")
multiline_select_tbl <- tbl(conn, multiline_select)


# Multiline select statement with trailing inline comment. Works.
multiline_comment <- sql("
SELECT *
FROM test_table
WHERE value > 20 -- this is a filter")
multiline_comment_tbl <- tbl(conn, multiline_comment)


# Trailing semicolon statement. Fails.
trailing_semicolon <- sql("
SELECT *
FROM test_table
WHERE value > 20;")
trailing_semicolon_tbl <- tbl(conn, trailing_semicolon)
# Error in `db_query_fields.DBIConnection()`:
# ! Can't query fields.
# ℹ Using SQL: SELECT * FROM (

# SELECT * FROM test_table WHERE value > 20 ; ) AS `q01` WHERE (0 =
#   1)
# Caused by error:
# ! near ";": syntax error


# single-line select with trailing inline comment
singleline_comment <- sql("SELECT * from test_table WHERE value > 20 --this is a filter")
singleline_comment_tbl <- tbl(conn, singleline_comment)
#  Error in `db_query_fields.DBIConnection()`:
# ! Can't query fields.
# ℹ Using SQL: SELECT * FROM (SELECT * from test_table WHERE value >
#   20 --this is a filter) AS `q02` WHERE (0 = 1)
# Caused by error:
# ! near "WHERE": syntax error


# single-line select statement with midline inline comment. Expected to fail, fails
singleline_midline_comment <- sql("SELECT * from test_table -- breakme WHERE value > 20")
singleline_midline_comment_tbl <- tbl(conn, singleline_midline_comment)
# Error in `db_query_fields.DBIConnection()`:
# ! Can't query fields.
# ℹ Using SQL: SELECT * FROM (SELECT * from test_table -- breakme
#   WHERE value > 20) AS `q03` WHERE (0 = 1)
# Caused by error:
# ! incomplete input

@chendaniely chendaniely merged commit 043667a into posit-dev:main Jul 30, 2025
16 checks passed
@gadenbuie
Copy link
Contributor

gadenbuie commented Jul 30, 2025

I think returning a tbl() is a great idea and would love to see that realized. Just an idea, but maybe you could wrap the LLM-generated SQL statement into a view, e.g.

CREATE OR REPLACE VIEW querychat_results_{table} AS
-- LLM-generated SQL statement
SELECT * FROM {table} WHERE llm_conditions;

that you execute with DBI::dbExecute(), but then you return a tbl pointing at the view, e.g. tbl(con, "querychat_results_{table}")

@npelikan
Copy link
Contributor Author

@gadenbuie great idea! The one issue I see with that though is that the database user under which querychat is accessing the data would need CREATE VIEW permissions on the database. Especially when considering the growing use of viewer-based credentials (via Connect), I don't think we should consider that a given.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants