Skip to content

Commit b46cc6f

Browse files
authored
Merge pull request #92 from TidierOrg/parse-blocks
2 parents 45e5013 + 55f4757 commit b46cc6f

18 files changed

+226
-22
lines changed

NEWS.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,9 @@
11
# TidierData.jl updates
22

3+
## v0.15.0 - 2024-02-25
4+
- Add support for `begin-end` blocks for all macros accepting multiple expressions
5+
- Bug fix to add support for expressions inside of `@group_by()`, as in `@group_by(b = a + 1)`
6+
37
## v0.14.7 - 2024-02-16
48
- Bug fix to allow `PackageName.function()` within macros to be used without escaping
59

Project.toml

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,7 +1,7 @@
11
name = "TidierData"
22
uuid = "fe2206b3-d496-4ee9-a338-6a095c4ece80"
33
authors = ["Karandeep Singh"]
4-
version = "0.14.7"
4+
version = "0.15.0"
55

66
[deps]
77
Chain = "8be319e6-bccf-4806-a6f7-6fae938471bc"

README.md

Lines changed: 1 addition & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -13,7 +13,7 @@ TidierData.jl is a 100% Julia implementation of the dplyr and tidyr R packages.
1313
extensive meta-programming capabilities, TidierData.jl is an R user’s love
1414
letter to data analysis in Julia.
1515

16-
`TidierData.jl` has three goals, which differentiate it from other data analysis
16+
`TidierData.jl` has two goals, which differentiate it from other data analysis
1717
meta-packages in Julia:
1818

1919
1. **Stick as closely to dplyr and tidyr syntax as possible:** Whereas other
@@ -30,16 +30,6 @@ meta-packages in Julia:
3030
automatically vectorized. Read the documentation page on "Autovectorization"
3131
to read about how this works, and how to override the defaults.
3232

33-
3. **Make scalars and tuples mostly interchangeable:** In Julia, the function
34-
`across(a, mean)` is dispatched differently than `across((a, b), mean)`.
35-
The first argument in the first instance above is treated as a scalar,
36-
whereas the second instance is treated as a tuple. This can be very confusing
37-
to R users because `1 == c(1)` is `TRUE` in R, whereas in Julia `1 == (1,)`
38-
evaluates to `false`. The design philosophy in `TidierData.jl` is that the user
39-
should feel free to provide a scalar or a tuple as they see fit anytime
40-
multiple values are considered valid for a given argument, such as in
41-
`across()`, and `TidierData.jl` will figure out how to dispatch it.
42-
4333
## Installation
4434

4535
For the stable version:

docs/examples/UserGuide/piping.jl

Lines changed: 131 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,131 @@
1+
# The easiest way to use TidierData.jl for complex data transformation operations is to connect them together using pipes. Julia comes with the built-in `|>` pipe operator, but TidierData.jl also includes and re-exports the `@chain` macro from the Chain.jl package. On this page, we will show you how to use both approaches.
2+
3+
# First, let's load a dataset.
4+
5+
using TidierData
6+
using RDatasets
7+
8+
movies = dataset("ggplot2", "movies");
9+
10+
# ## Julia's built-in `|>` pipe
11+
12+
# If we wanted to figure out the number of rows in the `movies` data frame, one way to do this is to apply the `nrow()` function to movies. The most straightforward way is to write it like this:
13+
14+
nrow(movies)
15+
16+
# Another perfectly valid way to write this expression is by piping `movies` into `nrow` using the `|>` pipe operator.
17+
18+
movies |> nrow
19+
20+
# Why might we want to do this? Well, whereas the first expression would naturally be read as "Calculate the number of rows of movies," the second expression reads as "Start with movies, then calculate the number of rows." For a simple expression, these are easy enough to reason about. However, as we start to pipe more and more functions in a single expression, the piped version becomes much easier to reason about.
21+
22+
# One quick note about Julia's built-in pipe: writing `movies |> nrow()` would *not* be considered valid. This is because Julia's built-in pipe always expects a function and *not* a function call. Writing `nrow` by itself is *naming* the function, whereas writing `nrow()` is *calling* the function. This quickly becomes an issue once we want to supply arguments to the function we are calling.
23+
24+
# Consider another approach to calculating the number of rows:
25+
26+
size(movies, 1)
27+
28+
# In this case, the `size()` function returns a tuple of `(rows, columns)`, and if you supply an optional second argument specifying the index of the tuple, it returns only that dimension. In this case, we called `size()` with a second argument of `1`, indicating that we only wanted the function to return the number of rows.
29+
30+
# How would we write this using Julia's built-in pipe?
31+
32+
movies |>
33+
x -> size(x, 1)
34+
35+
# You might have wanted to write `movies |> size(1)`, but because `size(1)` would represent a function *call*, we have to wrap the function call within an anonymous function, which is easily accomplished using the `x -> func(x, arg1, arg2)` syntax, where `func()` refers to any function and `arg1` and `arg2` refer to any additional arguments that are needed.
36+
37+
# Another way we could have accomplished this is to calculate `size`, which returns a tuple of `(rows, columns)`, and then to use an anonymous function to grab the first value. Since we are calculating `size` without any arguments, we can simply write `size` within the pipe. However, to grab the first value using the `x[1]` syntax, we have to define an anonymous function. Putting it all together, we get this approach to piping:
38+
39+
movies |>
40+
size |>
41+
x -> x[1]
42+
43+
# ## Using the `@chain` macro
44+
45+
# The `@chain` macro comes from the Chain.jl package and is included and re-exported by TidierData.jl. Let's do this same series of exercises using `@chain`.
46+
47+
# Let's calculate the number of rows using `@chain`.
48+
49+
@chain movies nrow
50+
51+
# One of the reasons we prefer the use of `@chain` in TidierData.jl is that it is so concise. There is no need for any operator. Another interesting thing is that `@chain` doesn't care whether you use a function *name* or a function *call*. Both approaches work. As a result, writing `nrow()` instead of `nrow` is equally valid using `@chain`.
52+
53+
@chain movies nrow()
54+
55+
# There are two options for writing out multi-row chains. The preferred approach is as follows, where the starting item is listed, followed by a `begin-end` block.
56+
57+
@chain movies begin
58+
nrow
59+
end
60+
61+
# `@chain` also comes with a built-in placeholder, which is `\_`. To calculate the `size` and extract the first value, we can use this approach:
62+
63+
@chain movies begin
64+
size
65+
_[1]
66+
end
67+
68+
# You don't have to list the data frame before the `begin-end` block. This is equally valid:
69+
70+
@chain begin
71+
movies
72+
size
73+
_[1]
74+
end
75+
76+
# The only time this approach is preferred is when instead of simply naming the data frame, you are using a function to read in the data frame from a file or database. Because this function call may include the path of the file, which could be quite long, it's easier to write this on it's own line within the `begin-end` block.
77+
78+
# While the documentation for TidierData.jl follows the convention of placing piped functions on separate lines of code using `begin-end` blocks, this is purely convention for ease of readability. You could rewrite the code above without the `begin-end` block as follows:
79+
80+
@chain movies size _[1]
81+
82+
# For simple transformations, this approach is both concise and readable.
83+
84+
# ## Using `@chain` with TidierData.jl
85+
86+
# Returning to our convention of multi-line pipes, let's grab the first five movies that were released since 2000 and had a rating of at least 9 out of 10. Here is one way that we could write this:
87+
88+
@chain movies begin
89+
@filter(Year >= 2000 && Rating >= 9)
90+
@slice(1:5)
91+
end
92+
93+
# Note: we generally prefer using `&&` in Julia because it is a "short-cut" operator. If the first condition evaluates to `false`, then the second condition is not even evaluated, which makes it faster (because it takes a short-cut).
94+
95+
# In the case of `@filter`, multiple conditions can be written out as separate expressions.
96+
97+
@chain movies begin
98+
@filter(Year >= 2000, Rating >= 9)
99+
@slice(1:5)
100+
end
101+
102+
# Another to write this expression is take advantage of the fact that Julia macros can be called without parentheses. In this case, we will add back the `&&` for the sake of readability.
103+
104+
@chain movies begin
105+
@filter Year >= 2000 && Rating >= 9
106+
@slice 1:5
107+
end
108+
109+
# Lastly, TidierData.jl also supports multi-line expressions within each of the macros that accept multiple expressions. So you could also write this as follows:
110+
111+
@chain movies begin
112+
@filter begin
113+
Year >= 2000
114+
Rating >= 9
115+
end
116+
@slice 1:5
117+
end
118+
119+
# What's nice about this approach is that if you want to remove some criteria, you can easily comment out the relevant parts. For example, if you're willing to consider older movies, just comment out the `Year >= 2000`.
120+
121+
@chain movies begin
122+
@filter begin
123+
## Year >= 2000
124+
Rating >= 9
125+
end
126+
@slice 1:5
127+
end
128+
129+
# ## Which approach to use?
130+
131+
# The purpose of this page was to show you that both Julia's native pipes and the `@chain` macro are perfectly valid and capable. We prefer the use of `@chain` because it is a bit more flexible and concise, with a syntax that makes it easy to comment out individual operations. We have adopted a similar `begin-end` block functionality within TidierData.jl itself, so that you can spread arguments out over multiple lines if you prefer. In the end, the choice is up to you!

docs/mkdocs.yml

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -127,6 +127,7 @@ nav:
127127
- "@arrange" : "examples/generated/UserGuide/arrange.md"
128128
- "@distinct" : "examples/generated/UserGuide/distinct.md"
129129
- "across" : "examples/generated/UserGuide/across.md"
130+
- "Piping" : "examples/generated/UserGuide/piping.md"
130131
- "Conditionals": "examples/generated/UserGuide/conditionals.md"
131132
- "Joins" : "examples/generated/UserGuide/joins.md"
132133
- "Binding" : "examples/generated/UserGuide/binding.md"

docs/src/index.md

Lines changed: 1 addition & 6 deletions
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ Powered by the DataFrames.jl package and Julia’s
88
extensive meta-programming capabilities, TidierData.jl is an R user’s love
99
letter to data analysis in Julia.
1010

11-
`TidierData.jl` has three goals, which differentiate it from other data analysis
11+
`TidierData.jl` has two goals, which differentiate it from other data analysis
1212
meta-packages in Julia:
1313

1414
```@raw html
@@ -45,11 +45,6 @@ meta-packages in Julia:
4545
Broadcasting trips up many R users switching to Julia because R users are used to most functions being vectorized. `TidierData.jl` currently uses a lookup table to decide which functions *not* to vectorize; all other functions are automatically vectorized. Read the documentation page on "Autovectorization" to read about how this works, and how to override the defaults. An example of where this issue commonly causes errors is when centering a variable. To create a new column `a` that centers the column `b`, `TidierData.jl` lets you simply write `a = b - mean(b)` exactly as you would in R. This works because `TidierData.jl` knows to *not* vectorize `mean()` while also recognizing that `-` *should* be vectorized such that this expression is rewritten in `DataFrames.jl` as `:b => (b -> b .- mean(b)) => :a`. For any user-defined function that you want to "mark" as being non-vectorized, you can prefix it with a `~`. For example, a function `new_mean()`, if it had the same functionality as `mean()` *would* normally get vectorized by `TidierData.jl` unless you write it as `~new_mean()`.
4646
```
4747

48-
```@raw html
49-
??? tip "Make scalars and tuples mostly interchangeable."
50-
In Julia, the function `across(a, mean)` is dispatched differently than `across((a, b), mean)`. The first argument in the first instance above is treated as a scalar, whereas the second instance is treated as a tuple. This can be very confusing to R users because `1 == c(1)` is `TRUE` in R, whereas in Julia `1 == (1,)` evaluates to `false`. The design philosophy in `TidierData.jl` is that the user should feel free to provide a scalar or a tuple as they see fit anytime multiple values are considered valid for a given argument, such as in `across()`, and `TidierData.jl` will figure out how to dispatch it.
51-
```
52-
5348
## Installation
5449

5550
For the stable version:

src/TidierData.jl

Lines changed: 11 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -73,6 +73,7 @@ end
7373
$docstring_select
7474
"""
7575
macro select(df, exprs...)
76+
exprs = parse_blocks(exprs...)
7677
interpolated_exprs = parse_interpolation.(exprs)
7778

7879
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -131,6 +132,7 @@ end
131132
$docstring_transmute
132133
"""
133134
macro transmute(df, exprs...)
135+
exprs = parse_blocks(exprs...)
134136
interpolated_exprs = parse_interpolation.(exprs)
135137

136138
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -189,6 +191,7 @@ end
189191
$docstring_rename
190192
"""
191193
macro rename(df, exprs...)
194+
exprs = parse_blocks(exprs...)
192195
interpolated_exprs = parse_interpolation.(exprs)
193196

194197
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -247,6 +250,7 @@ end
247250
$docstring_mutate
248251
"""
249252
macro mutate(df, exprs...)
253+
exprs = parse_blocks(exprs...)
250254
interpolated_exprs = parse_interpolation.(exprs)
251255

252256
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -305,6 +309,7 @@ end
305309
$docstring_summarize
306310
"""
307311
macro summarize(df, exprs...)
312+
exprs = parse_blocks(exprs...)
308313
interpolated_exprs = parse_interpolation.(exprs; from_summarize = true)
309314

310315
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -376,6 +381,7 @@ end
376381
$docstring_filter
377382
"""
378383
macro filter(df, exprs...)
384+
exprs = parse_blocks(exprs...)
379385
interpolated_exprs = parse_interpolation.(exprs)
380386

381387
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -434,6 +440,7 @@ end
434440
$docstring_group_by
435441
"""
436442
macro group_by(df, exprs...)
443+
exprs = parse_blocks(exprs...)
437444
interpolated_exprs = parse_interpolation.(exprs)
438445

439446
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -444,7 +451,7 @@ macro group_by(df, exprs...)
444451
grouping_exprs = parse_group_by.(exprs)
445452

446453
df_expr = quote
447-
local any_expressions = all(typeof.($tidy_exprs) .!= QuoteNode)
454+
local any_expressions = any(typeof.($tidy_exprs) .!= QuoteNode)
448455

449456
if $any_found_n || $any_found_row_number || any_expressions
450457
if $(esc(df)) isa GroupedDataFrame
@@ -494,6 +501,7 @@ end
494501
$docstring_arrange
495502
"""
496503
macro arrange(df, exprs...)
504+
exprs = parse_blocks(exprs...)
497505
arrange_exprs = parse_desc.(exprs)
498506
df_expr = quote
499507
if $(esc(df)) isa GroupedDataFrame
@@ -518,6 +526,7 @@ end
518526
$docstring_distinct
519527
"""
520528
macro distinct(df, exprs...)
529+
exprs = parse_blocks(exprs...)
521530
interpolated_exprs = parse_interpolation.(exprs)
522531

523532
tidy_exprs = [i[1] for i in interpolated_exprs]
@@ -620,6 +629,7 @@ end
620629
$docstring_rename_with
621630
"""
622631
macro rename_with(df, fn, exprs...)
632+
exprs = parse_blocks(exprs...)
623633
interpolated_exprs = parse_interpolation.(exprs)
624634

625635
tidy_exprs = [i[1] for i in interpolated_exprs]

src/binding.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
$docstring_bind_rows
33
"""
44
macro bind_rows(df, exprs...)
5+
exprs = parse_blocks(exprs...)
56
tidy_exprs = parse_bind_args.(exprs)
67
locate_id = findfirst(i -> i[2], tidy_exprs)
78
if locate_id isa Nothing
@@ -23,6 +24,7 @@ end
2324
$docstring_bind_cols
2425
"""
2526
macro bind_cols(df, exprs...)
27+
exprs = parse_blocks(exprs...)
2628
tidy_exprs = parse_bind_args.(exprs)
2729
df_vec = [i[1] for i in tidy_exprs]
2830

src/compound_verbs.jl

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -6,6 +6,7 @@
66
$docstring_tally
77
"""
88
macro tally(df, exprs...)
9+
exprs = parse_blocks(exprs...)
910
wt, sort = parse_tally_args(exprs...)
1011

1112
wt_quoted = QuoteNode(wt)
@@ -51,6 +52,7 @@ end
5152
$docstring_count
5253
"""
5354
macro count(df, exprs...)
55+
exprs = parse_blocks(exprs...)
5456
col_names, wt, sort = parse_count_args(exprs...)
5557

5658
col_names_quoted = QuoteNode(col_names)

src/docstrings.jl

Lines changed: 33 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -424,7 +424,24 @@ rows as `df`.
424424
julia> df = DataFrame(a = 'a':'e', b = 1:5, c = 11:15);
425425
426426
julia> @chain df begin
427-
@mutate(d = b + c, b_minus_mean_b = b - mean(b))
427+
@mutate(d = b + c,
428+
b_minus_mean_b = b - mean(b))
429+
end
430+
5×5 DataFrame
431+
Row │ a b c d b_minus_mean_b
432+
│ Char Int64 Int64 Int64 Float64
433+
─────┼───────────────────────────────────────────
434+
1 │ a 1 11 12 -2.0
435+
2 │ b 2 12 14 -1.0
436+
3 │ c 3 13 16 0.0
437+
4 │ d 4 14 18 1.0
438+
5 │ e 5 15 20 2.0
439+
440+
julia> @chain df begin
441+
@mutate begin
442+
d = b + c
443+
b_minus_mean_b = b - mean(b)
444+
end
428445
end
429446
5×5 DataFrame
430447
Row │ a b c d b_minus_mean_b
@@ -511,14 +528,27 @@ Create a new DataFrame with one row that aggregating all observations from the i
511528
julia> df = DataFrame(a = 'a':'e', b = 1:5, c = 11:15);
512529
513530
julia> @chain df begin
514-
@summarize(mean_b = mean(b), median_b = median(b))
531+
@summarize(mean_b = mean(b),
532+
median_b = median(b))
515533
end
516534
1×2 DataFrame
517535
Row │ mean_b median_b
518536
│ Float64 Float64
519537
─────┼───────────────────
520538
1 │ 3.0 3.0
521-
539+
540+
julia> @chain df begin
541+
@summarize begin
542+
mean_b = mean(b)
543+
median_b = median(b)
544+
end
545+
end
546+
1×2 DataFrame
547+
Row │ mean_b median_b
548+
│ Float64 Float64
549+
─────┼───────────────────
550+
1 │ 3.0 3.0
551+
522552
julia> @chain df begin
523553
@summarise(mean_b = mean(b), median_b = median(b))
524554
end

0 commit comments

Comments
 (0)