Skip to content

Commit dec61c2

Browse files
authored
Merge pull request #125 from TidierOrg/cte-gen
add `@pivot_wider` + Cte generation improvments
2 parents cdb6109 + 1ebedcd commit dec61c2

File tree

13 files changed

+382
-77
lines changed

13 files changed

+382
-77
lines changed

NEWS.md

Lines changed: 8 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,12 @@
11
# TidierDB.jl updates
2+
## v.8.3 - 2025-04-11
3+
- adds `@drop_missing`
4+
- adds `@pivot_wider`
5+
- `db_table` or `dt` accept paths to .sas7bdat, .xpt, .sav, .zsav, .por, .dta files with DuckDB
6+
- Improvements to CTE generation
7+
- add kwarg `overwrite = false` to `copy_to` to default table copying to not replace exisiting tables with the name.
8+
- separate `@summary` into its own macro for collecting summary statistics (max, min, q1, q2, q3, avg, std, count, unique) from a table or file
9+
210
## v0.8.0 - 2025-03-24
311
- adds `@transmute`
412
- adds `@separate` and `@unite`

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "TidierDB"
22
uuid = "86993f9b-bbba-4084-97c5-ee15961ad48b"
33
authors = ["Daniel Rizk <rizk.daniel.12@gmail.com> and contributors"]
4-
version = "0.8.2"
4+
version = "0.8.3"
55

66
[deps]
77
Arrow = "69666777-d1a9-59fb-9406-91d4454c9d45"

docs/examples/UserGuide/ex_joining.jl

Lines changed: 2 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -26,21 +26,11 @@
2626
# ## Examples
2727
# Examples below will cover how to join tables with different schemas in different databases,
2828
# and how to write queries on tables and then join them together, and how to do this by levaraging views. Some examples
29-
# <!--
29+
3030
using TidierDB
3131
db = connect(duckdb())
3232
mtcars = dt(db, "https://gist.githubusercontent.com/seankross/a412dfbd88b3db70b74b/raw/5f23f993cd87c283ce766e7ac6b329ee7cc2e1d1/mtcars.csv")
33-
# -->
3433

35-
# ## Setup
36-
# ```julia
37-
# using TidierDB
38-
# db = connect(duckdb(), "md:")
39-
#
40-
# mtcars = dt(db, "my_db.mtcars")
41-
# mt2 = dt(db, "ducks_db.mt2")
42-
# ```
43-
#
4434
# ## Wrangle tables and self join
4535
query = @chain mtcars begin
4636
@group_by cyl
@@ -70,6 +60,7 @@ end
7060
# To connect to a table in a different schema, prefix it with a dot. For example, "schema_name.table_name".
7161
# In this query, we are also filtering out cars that contain "M" in the name from the `mt2` table before joining.
7262
# ```julia
63+
# mt2 = dt(db, "ducks_db.mt2")
7364
# other_db = @chain dt(db, "ducks_db.mt2") @filter(!str_detect(car, "M"))
7465
# @chain mtcars begin
7566
# @left_join(t(other_db), model == car)

docs/examples/UserGuide/file_reading.jl

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -9,9 +9,11 @@
99
# - S3 buckets
1010
# - iceberg and delta - require additional args `delta` or `iceberg` to be set to `true`
1111
# - Google Sheets (first run `connect(db, :gsheets)`)
12+
# - .sas7bdat, .xpt, .sav, .zsav, .por, .dta : `dt(db, "any/file/path/to.sav")`
1213
#
1314

14-
# `dt` allso supports directly using any DuckDB file reading function. This allows for easily reading in compressed files
15+
# `dt` also supports directly using DuckDB file reading function. This enables easily reading in compressed files
16+
1517
# When reading in a compresssed path, adding an `alias` is recommended.
1618
# - `dt(db, "read_csv('/Volumes/Untitled/phd_*_genlab.txt', ignore_errors=true)", alias = "genlab")`
1719

docs/examples/UserGuide/misc_tips.jl

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -13,8 +13,8 @@ dfv = dt(db, df, "dfv");
1313

1414
# ## DuckDB's SUMMARIZE
1515
# DuckDB has a feature tosummarize tables that gives information about the table, such as mean, std, q25, q75 etc.
16-
# To use this feature with TidierDB, simply call an empty `@summarize`.
17-
@chain dfv @summarize() @collect
16+
# To use this feature with TidierDB, simply call an `@summary` on any table or file before querying it.
17+
@chain dfv @summary() @collect
1818

1919
# ## show_query/collect
2020
# If you find yourself frequently showing a query while collecting, you can define the following function
@@ -25,7 +25,7 @@ sqc(qry) = @chain qry begin
2525

2626
# Call this function at the end of a chain similar the `@show_query` or`@collect` macros
2727
# _printed query is not seen here as it prints to the REPL_
28-
@chain dfv @summarize() sqc()
28+
@chain dfv @summary() sqc()
2929

3030
# ## Color Printing
3131
# Queries print with some code words in color to the REPL. To turn off this feature, run one of the following.

docs/mkdocs.yml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -118,7 +118,7 @@ nav:
118118
- "Home": "index.md"
119119
- "Key Differences from TidierData.jl" : "examples/generated/UserGuide/key_differences.md"
120120
- "Getting Started" : "examples/generated/UserGuide/getting_started.md"
121-
- "File Reading" : "examples/generated/UserGuide/file_reading.md"
121+
- "File Reading/Writing" : "examples/generated/UserGuide/file_reading.md"
122122
- "Joining Tables" : "examples/generated/UserGuide/ex_joining.md"
123123
- "Aggregate and Window Functions" : "examples/generated/UserGuide/agg_window.md"
124124
- "Flexible Syntax and UDFs" : "examples/generated/UserGuide/udfs_ex.md"

src/TidierDB.jl

Lines changed: 31 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -19,7 +19,7 @@ using Crayons
1919
@distinct, @left_join, @right_join, @inner_join, @count, @slice_max, @union,
2020
@slice_min, @slice_sample, @rename, @relocate, @union_all, @setdiff, @intersect,
2121
@semi_join, @full_join, @transmute, @anti_join, @head, @unnest_wider, @unnest_longer,
22-
@separate, @unite, @drop_missing
22+
@separate, @unite, @drop_missing, @pivot_wider, @summary
2323

2424
export db_table, set_sql_mode, connect, from_query, update_con,
2525
clickhouse, duckdb, sqlite, mysql, mssql, postgres, athena, snowflake, gbq,
@@ -72,6 +72,7 @@ include("relocate.jl")
7272
include("union_intersect_setdiff.jl")
7373
include("unnest.jl")
7474
include("sep_unite.jl")
75+
include("pivots.jl")
7576

7677

7778
# Unified expr_to_sql function to use right mode
@@ -173,7 +174,11 @@ function db_table(db, table, athena_params::Any=nothing; iceberg::Bool=false, de
173174
# println(table_name2)
174175
alias == "" ? alias = "gsheet" : alias = alias
175176
metadata = get_table_metadata(db, table_name2, alias = alias)
176-
elseif startswith(table_name, "read")
177+
elseif any(endswith(table_name, ext) for ext in [".sas7bdat", ".xpt", ".sav", ".zsav", ".por", ".dta"])
178+
DuckDB.query(db, "install read_stat from community; load read_stat")
179+
table_name2 = "read_stat('$table_name')"
180+
metadata = get_table_metadata(db, table_name2)
181+
elseif startswith(table_name, "read")
177182
table_name2 = "$table_name"
178183
alias = alias == "" ? "data" : alias
179184
# println(table_name2)
@@ -291,24 +296,41 @@ function db_table(db, table::Vector{String}, athena_params::Any=nothing)
291296
end
292297

293298
function db_table(db, table::DataFrame, alias::String)
299+
if any(any(lowercase(string(name)) == word for word in sql_words) for name in names(table))
300+
found_words = [word for word in sql_words if any(lowercase(string(name)) == word for name in names(table))]
301+
@warn "Column names containing SQL keywords detected: $(join(found_words, ", ")).
302+
These may cause issues as they are reserved SQL keywords.
303+
Consider renaming the columns before scanning to DuckDB."
304+
end
305+
# COV_EXCL_STOP
294306
DuckDB.register_data_frame(db, table, alias)
295307
metadata = get_table_metadata(db, alias)
296308
return SQLQuery(from = alias, metadata=metadata, db=db)
297309
end
298310
const dt = db_table # COV_EXCL_LINE
299-
# COV_EXCL_STOP
311+
312+
313+
sql_words = ["group", "select", "from", "where", "having", "order", "by", "join", "union", "case", "when", "then", "else", "end", "limit", "right", "left"] # COV_EXCL_LINE
300314

301315
"""
302316
$docstring_copy_to
303317
"""
304-
function copy_to(conn, df_or_path::Union{DataFrame, AbstractString}, name::String)
318+
function copy_to(conn, df_or_path::Union{DataFrame, AbstractString}, name::String; overwrite::Bool=false)
305319
# Check if the input is a DataFrame
320+
rep = overwrite ? "OR REPLACE" : ""
306321
if isa(df_or_path, DataFrame)
307322
if current_sql_mode[] == duckdb()
308323
name_view = name * "view"
309324
DuckDB.register_data_frame(conn, df_or_path, name_view)
310-
DBInterface.execute(conn, "CREATE OR REPLACE TABLE $name AS SELECT * FROM $name_view")
325+
DBInterface.execute(conn, "CREATE $rep TABLE $name AS SELECT * FROM $name_view")
311326
DBInterface.execute(conn, "DROP VIEW $name_view ")
327+
# Check for 'group' in column names and warn if found
328+
if any(any(lowercase(string(name)) == word for word in sql_words) for name in names(df_or_path))
329+
found_words = [word for word in sql_words if any(lowercase(string(name)) == word for name in names(df_or_path))]
330+
@warn "Column names containing SQL keywords detected: $(join(found_words, ", ")).
331+
These may cause issues as they are reserved SQL keywords.
332+
Consider renaming the columns before copying to DuckDB."
333+
end
312334
end
313335
# COV_EXCL_START
314336
elseif isa(df_or_path, AbstractString)
@@ -323,24 +345,24 @@ function copy_to(conn, df_or_path::Union{DataFrame, AbstractString}, name::Strin
323345
end
324346
if occursin(r"\.csv$", df_or_path)
325347
# Construct and execute a SQL command for loading a CSV file
326-
sql_command = "CREATE TABLE $name AS SELECT * FROM '$df_or_path';"
348+
sql_command = "CREATE $rep TABLE $name AS SELECT * FROM '$df_or_path';"
327349
DuckDB.execute(conn, sql_command)
328350
elseif occursin(r"\.parquet$", df_or_path)
329351
# Construct and execute a SQL command for loading a Parquet file
330-
sql_command = "CREATE TABLE $name AS SELECT * FROM '$df_or_path';"
352+
sql_command = "CREATE $rep TABLE $name AS SELECT * FROM '$df_or_path';"
331353
DuckDB.execute(conn, sql_command)
332354
elseif occursin(r"\.arrow$", df_or_path)
333355
# Construct and execute a SQL command for loading a CSV file
334356
arrow_table = Arrow.Table(df_or_path)
335357
DuckDB.register_table(conn, arrow_table, name)
336358
elseif occursin(r"\.json$", df_or_path)
337359
# For Arrow files, read the file into a DataFrame and then insert
338-
sql_command = "CREATE TABLE $name AS SELECT * FROM read_json('$df_or_path');"
360+
sql_command = "CREATE $rep TABLE $name AS SELECT * FROM read_json('$df_or_path');"
339361
DuckDB.execute(conn, "INSTALL json;")
340362
DuckDB.execute(conn, "LOAD json;")
341363
DuckDB.execute(conn, sql_command)
342364
elseif startswith(df_or_path, "read")
343-
DuckDB.execute(conn, "CREATE TABLE $name AS SELECT * FROM $df_or_path;")
365+
DuckDB.execute(conn, "CREATE $rep TABLE $name AS SELECT * FROM $df_or_path;")
344366
else
345367
error("Unsupported file type for: $df_or_path")
346368
end

src/TidierDB_macros.jl

Lines changed: 10 additions & 45 deletions
Original file line numberDiff line numberDiff line change
@@ -10,7 +10,7 @@ macro select(sqlquery, exprs...)
1010
sq = $(esc(sqlquery))
1111
sq = sq.post_first ? (t($(esc(sqlquery)))) : sq
1212
sq.post_first = false;
13-
build_cte!(sq)
13+
if sq.select != "" build_cte!(sq); sq.select == ""; end
1414
let columns = parse_tidy_db(exprs_str, sq.metadata)
1515
columns_str = join(["SELECT ", join([string(column) for column in columns], ", ")])
1616
sq.select = columns_str
@@ -51,7 +51,6 @@ macro filter(sqlquery, conditions...)
5151
sq.post_first = false;
5252

5353
if isa(sq, SQLQuery)
54-
# Early handling for non-aggregated context
5554
if !sq.is_aggregated
5655
if sq.post_join
5756
combined_conditions = String[]
@@ -61,6 +60,7 @@ macro filter(sqlquery, conditions...)
6160
push!(combined_conditions, condition_str)
6261
end
6362
combined_condition_str = join(combined_conditions, " AND ")
63+
6464
sq.where = " WHERE " * combined_condition_str
6565
# sq.post_join = false
6666
else
@@ -72,23 +72,18 @@ macro filter(sqlquery, conditions...)
7272
push!(combined_conditions, condition_str)
7373
end
7474
combined_condition_str = join(combined_conditions, " AND ")
75-
new_cte = CTE(name=cte_name, select="*", from=(isempty(sq.ctes) ? sq.from : last(sq.ctes).name), where=combined_condition_str)
76-
up_cte_name(sq, cte_name)
77-
78-
push!(sq.ctes, new_cte)
79-
sq.from = cte_name
80-
sq.cte_count += 1
81-
# matching_indices = findall(sq.metadata.name .== 2)
82-
# sq.metadata.current_selxn[matching_indices] .= 1
75+
76+
sq.where = combined_condition_str
77+
# println(sq.from)
78+
build_cte!(sq)
79+
sq.select = " * "
8380
end
8481
else
8582
aggregated_columns = Set{String}()
8683

87-
# Check SELECT clause of the main query and all CTEs for aggregation functions
8884
if !isempty(sq.select)
8985
for part in split(sq.select, ", ")
9086
if occursin(" AS ", part)
91-
# Extract the alias used after 'AS' which represents an aggregated column
9287
aggregated_column = strip(split(part, " AS ")[2])
9388
push!(aggregated_columns, aggregated_column)
9489
end
@@ -118,20 +113,9 @@ macro filter(sqlquery, conditions...)
118113
end
119114
if !isempty(non_aggregated_conditions)
120115
combined_conditions = join(non_aggregated_conditions, " AND ")
121-
cte_name = "cte_" * string(sq.cte_count + 1)
122-
new_cte = CTE(name=cte_name, select=sq.select, from=(isempty(sq.ctes) ? sq.from : last(sq.ctes).name), groupBy = sq.groupBy, having=sq.having)
123-
up_cte_name(sq, cte_name)
124-
125-
push!(sq.ctes, new_cte)
116+
build_cte!(sq)
126117
sq.select = "*"
127-
sq.groupBy = ""
128-
sq.having = ""
129-
130118
sq.where = "WHERE " * join(non_aggregated_conditions, " AND ")
131-
sq.from = cte_name
132-
sq.cte_count += 1
133-
# matching_indices = findall(sq.metadata.name .== 2)
134-
# sq.metadata.current_selxn[matching_indices] .= 1
135119
end
136120
end
137121

@@ -146,13 +130,9 @@ end
146130

147131

148132
function _colref_to_string(col)
149-
# If it's already a bare Symbol, just convert to string
150133
if isa(col, Symbol)
151134
return string(col)
152-
# If it's an expression using the dot operator, e.g. `sales.id`
153135
elseif isa(col, Expr) && col.head === :.
154-
# col.args[1] = the "parent" (could be another dotted expr)
155-
# col.args[2] = the field name (usually a Symbol)
156136
parent_str = _colref_to_string(col.args[1])
157137
field_str = string(col.args[2].value)
158138
return parent_str * "." * field_str
@@ -212,16 +192,6 @@ macro group_by(sqlquery, columns...)
212192

213193
sq.groupBy = group_clause
214194

215-
# if isempty(sq.select) || sq.select == "SELECT "
216-
# sq.select = "SELECT " * join(group_columns, ", ")
217-
# else
218-
# for col in group_columns
219-
# if !contains(sq.select, col)
220-
# sq.select = sq.select * ", " * col
221-
# end
222-
# end
223-
# end
224-
225195
current_group_columns = group_columns
226196
summarized_columns = split(sq.select, ", ")[2:end] # Exclude the initial SELECT
227197
all_columns = unique(vcat(current_group_columns, summarized_columns))
@@ -257,15 +227,9 @@ macro distinct(sqlquery, distinct_columns...)
257227
cte_select = !isempty(distinct_cols_str) ? " DISTINCT " * distinct_cols_str : " DISTINCT *"
258228
cte_select *= " FROM " * sq.from
259229

260-
# Create the CTE instance
261230
cte = CTE(name=cte_name, select=cte_select)
262-
# Add the CTE to the SQLQuery's CTEs vector
263231
push!(sq.ctes, cte)
264-
265-
# Adjust the main query to select from the newly created CTE
266232
sq.from = cte_name
267-
268-
# Reset sq.select to ensure the final SELECT * operates correctly
269233
sq.select = "*"
270234
end
271235
sq
@@ -278,7 +242,7 @@ $docstring_count
278242
macro count(sqlquery, group_by_columns...)
279243
# Set default sort expression to true.
280244
sort_expr = :(true)
281-
# Check if the last argument is a keyword assignment for sort.
245+
282246
if length(group_by_columns) > 0 &&
283247
isa(group_by_columns[end], Expr) &&
284248
group_by_columns[end].head == :(=) &&
@@ -426,6 +390,7 @@ macro show_query(sqlquery)
426390
formatted_query = replace(formatted_query, " OUTER JOIN " => "\n\tOUTER JOIN ")
427391
formatted_query = replace(formatted_query, " ASOF " => "\n\tASOF ")
428392
formatted_query = replace(formatted_query, " LIMIT " => "\n\tLIMIT ")
393+
formatted_query = replace(formatted_query, " ANY_VALUE" => "\n\tANY_VALUE")
429394

430395
pattern = r"\b(cte_\w+|WITH|FROM|SELECT|AS|LEFT|JOIN|RIGHT|OUTER|UNION|INNER|ASOF|GROUP\s+BY|CASE|WHEN|THEN|ELSE|END|WHERE|HAVING|ORDER\s+BY|PARTITION|ASC|DESC|INNER)\b"
431396
# COV_EXCL_START

0 commit comments

Comments
 (0)