add support for DISTINCT ON#1620
Conversation
|
This seems like a reasonable approach to me. Are you still interested in working on this? If so, I can give you a detailed review. If not, I'm happy to finish it off. |
Both options are fine with me. Whatever is more convenient to you. |
|
Are you interested in doing more PRs in the future? If so, I'm happy to invest the time in review 😀 |
|
In that case I would be happy about a review. :) |
hadley
left a comment
There was a problem hiding this comment.
Partial review; I'll complete after rebasing so it's styled with air.
| stopifnot(is_lazy_sql_part(order_by)) | ||
| check_number_whole(limit, allow_infinite = TRUE, allow_null = TRUE) | ||
| check_bool(distinct) | ||
| # distinct = FALSE -> no distinct |
There was a problem hiding this comment.
This is a great comment that helps me understand the intent of the rest of the code 😄
But it does illustrates a problem with lazy_select_query() — it isn't properly documented and thus it's hard to tell what the types of the inputs are. You don't need to fix it here, but it might be a useful follow up PR.
R/lazy-select-query.R
Outdated
| } else { | ||
| select <- new_lazy_select(select) | ||
| if (!is.logical(distinct)) { | ||
| distinct <- new_lazy_select(distinct) |
There was a problem hiding this comment.
Why do this here and not in sql_build.lazy_select_query?
There was a problem hiding this comment.
I moved it to sql_build.lazy_select_query() now. Originally I did this here because it seemed more consistent with the select variable.
R/lazy-select-query.R
Outdated
| #' @export | ||
| print.lazy_select_query <- function(x, ...) { | ||
| cat_line("<SQL SELECT", if (x$distinct) " DISTINCT", ">") | ||
| cat_line("<SQL SELECT", if (!isFALSE(x$distinct)) " DISTINCT", |
There was a problem hiding this comment.
Do you want to show the variables here?
There was a problem hiding this comment.
I added a new line that prints the DISTINCT ON variables.
R/sql-clause.R
Outdated
|
|
||
| sql_distinct <- | ||
| if (isTRUE(distinct)) { | ||
| " DISTINCT" |
There was a problem hiding this comment.
Our usual style would be to put sql_distinct <- in each branch or extract out a helper function. I'd probably lean towards a helper function since you could inline that in clause below.
There was a problem hiding this comment.
Like I have done now? Or should the helper function be inside sql_clause_select()?
hadley
left a comment
There was a problem hiding this comment.
The code seems solid to me overall — just lots of small style questions. Thanks again for working on this!
R/verb-distinct.R
Outdated
| if (is_identity(syms(distinct), names(distinct), colnames(.data))) { | ||
| TRUE | ||
| } else { | ||
| syms(distinct) |
There was a problem hiding this comment.
Do you think that's worth checking? Seems like it would be pretty rare.
There was a problem hiding this comment.
Yeah, you are right. I removed the extra check.
Thanks for the review! The ordering issue is still open (from the MR description):
Do you have thoughts on that? |
|
Hmmmm, the A quick conversation with claude suggest that |
This PR is a first draft to address duckdb/duckdb-r#384, and add support for the usage of "DISTINCT ON" when using
distinct(..., .keep_all = TRUE)which is a SQL variant supported by PostgreSQL, and DuckDB, see e.g. https://duckdb.org/docs/stable/sql/query_syntax/select.html#distinct-on-clause. Currently,.keep_all = TRUEis implemented using window functions. UsingDISTINCT ONinstead promises a performance boost of that operation.I came to the conclusion that this cannot be addressed within an external dbplyr backend, but it requires a minor modification of the
lazy_select_querydata structure itself:Currently, a
lazy_select_querysupports adistinctstate, which can be eitherTRUEorFALSE(corresponding to a normalSELECTvs. aSELECT DISTINCT.The basic idea of this PR is to add a third state to the
distinctattribute which represents a list of columns that belong to theSELECT DISTICT ON (...)clause.Dbplyr backends can opt in to make use of
DISTINCT ONvia implementing a method of the new genericsupports_distinct_on()that returnsTRUE.Open issues
DISTINCT ONusesORDER BYto specify an ordering. That is also a reason whyORDER BYis allowed in subqueries in PostgreSQL and DuckDB. As far as I can see, dbplyr currently forbids theORDER BYstatement in subqueries. I did not investigate yet, if that can be modified easily, or even can be changed at all. In any case, from the user-perspectivewindow_order()would probably still be the right verb to specify the order.sql_clause()is not used correctly.distinctattribute that holds eitherTRUE,FALSEor a column list representation leads to a lot of required case distinction checks in the code, which are rather unpleasent to read and complicate the code. There is probably a way to model this in a more streamlined fashion.I would appreciate feedback to the open issues as well as the already existing code. It might not be the right approach after all.