Skip to content

Commit 376aa17

Browse files
committed
improve on the article
1 parent c3aa14f commit 376aa17

File tree

6 files changed

+232
-45
lines changed

6 files changed

+232
-45
lines changed
61.6 KB
Loading

content/post/2025-03-19-r-basic-advanceds-variables-and-names-in-dplyr/embracing.svg

Lines changed: 115 additions & 0 deletions
Loading

content/post/2025-03-19-r-basic-advanceds-variables-and-names-in-dplyr/index.Rmd

Lines changed: 61 additions & 23 deletions
Original file line numberDiff line numberDiff line change
@@ -11,6 +11,7 @@ output:
1111
toc: true
1212
images:
1313
- selection-ambiguity.png
14+
- embracing.png
1415
---
1516

1617
## Intro
@@ -120,76 +121,113 @@ my_subset_with_symbols <- function(data, my_var_as_symbol) {
120121
121122
my_subset_with_symbols(iris, Petal.Length)
122123
123-
my_subset_with_symbols(iris, Petal.Length, Sepal.Width)
124+
# We still need to wrap column names in a vector if we provide more than one of them for a single parameter
125+
# (or we can use ellipsis operator for the function, but this is a separate design question)
126+
my_subset_with_symbols(iris, c(Petal.Length, Sepal.Width))
124127
```
125128

126129
In this way we let dplyr know that `my_var_as_symbol` has to be passed directly as user provided it. We can think of embracing as of cut-paste operation. We tell dplyr: "Take what user provided in place of `my_var_as_symbol` in function call and plug it directly into `select`, without creating any intermediate variables.". Call to `my_subset_with_symbols()` is basically replaced with what lies inside of it.
127130

131+
![Diagram showing how the embracing works.](embracing.png)
132+
128133
## Problem 3: Dynamic columns in purrr formulas in `across`
129134

130-
While the above solutions work seamlessly with functions like `dplyr::select()`, challenges arise when operations grow complex. Suppose we wish to craft a function, `do_magic`, that takes data, a special `column`, and several `others` columns. This function should add the special column to all others.
135+
While the above solutions work seamlessly with functions like `dplyr::select()`, challenges arise when operations grow complex. Suppose we wish to craft a function, `do_magic`, that takes `data`, a special `column`, and several `other` columns. This function should add the special `column` to all `other`. For now, do not assume in what form `column` and `other` parameters are provided.
136+
137+
The naive way of doing it, would be to construct some `dplyr::mutate()` call that would operate on each of the provided columns:
131138

132-
Leveraging `dplyr::mutate(dplyr::across())` can achieve this. Its syntax is:
133139

134140
```{r eval=FALSE}
135-
mutate(across(columns_to_mutate, function_to_apply))
141+
# only for illustration purposes, won't actually work:
142+
data %>%
143+
mutate(
144+
other[[1]] = other[[1]] + special,
145+
other[[2]] = other[[2]] + special,
146+
...
147+
other[[N]] = other[[N]] + special
148+
)
136149
```
137150

138-
For custom, unnamed functions, the *purrr formula syntax* (`~ expression` with `.x`) is beneficial. In our case (without enclosing it in a function yet) could look like:
139151

140-
```{r, eval=FALSE}
141-
iris %>%
142-
mutate(across(all_of(c("Sepal.Length", "Sepal.Width")), ~ .x - Petal.Length))
152+
As you might have known, the code above will not be functional, neither inside or outside function -- you cannot index neither character vector nor symbol on the left side of argument assignment in `dplyr::mutate()` call. We need to use another tool: `dplyr::across()`. Its syntax is:
153+
154+
```{r eval=FALSE}
155+
data %>% mutate(across(columns_to_mutate, function_to_apply))
156+
```
157+
158+
For custom, unnamed functions, the *function shorthand syntax* `\(x)` is beneficial. The idea from example above could be rewritten as:
159+
160+
```{r eval=FALSE}
161+
# still won't work, but we are getting somewhere:
162+
data %>%
163+
mutate(
164+
across(other, \(x) + special)
165+
)
143166
```
144167

145-
Elegant, isn't it? Now, let's proceed by encapsulating this logic within a function where column names are passed as strings:
168+
Now it is time to actually encapsulate this into a function and think about how to pass those column names as parameters. Since we are already armed with knowledge of previous chapter of this article we might try embracing first:
146169

147170
```{r}
148-
do_magic <- function(data, special, others) {
171+
do_magic <- function(data, special, other) {
149172
data %>%
150-
mutate(across(all_of(others), ~ .x - all_of(special)))
173+
mutate(across({{other}}, \(x) + {{special}}))
174+
}
175+
176+
do_magic(iris, Petal.Length, c(Sepal.Length, Petal.Width))
177+
```
178+
179+
Hooray! It works just fine! However, at this point it is worth trying it out another way and asking question: what if we want to pass those parameters as strings? Again, we can go back to the example from before and use supporting functions to transform the strings into actual selections:
180+
181+
```{r}
182+
do_magic <- function(data, special, other) {
183+
data %>%
184+
mutate(across(all_of(other), \(x) - all_of(special)))
151185
}
152186
153187
# won't work:
154-
# do_magic(iris, special = "Petal.Length", others = c("Sepal.Length", "Sepal.Width"))
188+
# do_magic(iris, special = "Petal.Length", other = c("Sepal.Length", "Sepal.Width"))
155189
```
156190

157-
Surprisingly, it fails! When used within the context of `across`, dplyr seems unable to utilize the tidyselect rules (the ones that make `all_of()` possible). But we're not defeated; let's try embracing:
191+
Surprisingly, it fails! The reason for that is simple: the function we pass into across (in this case: `\(x) - all_of(special)`) is unable to evaluate this selector function as it is unexpected there. Tidyselect rules (the ones that make `all_of()` and its friends possible) are not automagical and require to be invoked manually by the function designer. `dplyr::select` knows that it might expect such expressions but inside some seemingly random function it cannot evaluate properly on its own.
192+
193+
So, what to do now? We can try mixed approach with embracing:
158194

159195
```{r}
160-
do_magic_but_better <- function(data, special, others) {
196+
do_magic_but_better <- function(data, special, other) {
161197
data %>%
162-
mutate(across(all_of(others), ~ .x - {{special}}))
198+
mutate(across(all_of(other), ~ .x - {{special}}))
163199
}
164200
165-
do_magic_but_better(iris, special = Petal.Length, others = c("Sepal.Length", "Sepal.Width"))
201+
do_magic_but_better(iris, special = Petal.Length, other = c("Sepal.Length", "Sepal.Width"))
166202
```
167203

168-
By adopting this approach, it's imperative to provide special as a symbol. Also, this does not look fine: one parameter is provided as symbol, another one is as character vector... **We should always aim at being consistent**. Either all column-like parameters should be symbols or all should be character strings. There are pros and cons to both ways. Let's say that we want to stick to strings only. How can we do it?
204+
This works. How come then that embracing inside anonymous function works while `all_of` helper does not? This is because they use a very different approach and detailed explanation goes out of the scope of this article. To simplify: embracing is a more general approach for replacing one chunk of a code with another provided as a parameter.
205+
206+
The one issue with above approach is that it does not look fine: one parameter is provided as symbol, another one is as character vector... **We should always aim at being consistent**. Either all column-like parameters should be symbols or all should be character strings. There are pros and cons to both ways. Let's say that we want to stick to strings only. How can we do it?
169207

170208
#### Tip: when `all_of()` does not work, use `.data`
171209

172210
There's a workaround for this conundrum:
173211

174212
```{r}
175-
do_magic_but_in_other_way <- function(data, special, others) {
213+
do_magic_but_in_other_way <- function(data, special, other) {
176214
data %>%
177-
mutate(across(all_of(others), ~ .x - .data[[special]]))
215+
mutate(across(all_of(other), ~ .x - .data[[special]]))
178216
}
179217
180-
do_magic_but_in_other_way(iris, special = "Petal.Length", others = c("Sepal.Length", "Sepal.Width"))
218+
do_magic_but_in_other_way(iris, special = "Petal.Length", other = c("Sepal.Length", "Sepal.Width"))
181219
```
182220

183-
When you need to reference the underlying data within the context of functions, the `.data` pronoun comes to the rescue. As demonstrated, it operates similarly to directly accessing the data.
221+
When you need to reference the underlying data within the context of dplyr functions, the `.data` pronoun comes to the rescue. It is available also from within the function that is evaluated inside `across` helper. As demonstrated, it operates similarly to directly accessing the data and as a result, we can use regular base extraction operator.
184222

185223
## Summary & Next Steps
186224

187-
Throughout this post, we ventured deep into some of the intricacies of dplyr. We've unraveled how the package strives to make our code both semantic and syntactic, all while simplifying complex operations. The power of symbols and the utility of functions like `all_of()` and `.data` demonstrate just how dynamic and adaptable dplyr can be, especially when working with variable column names. While we've covered much ground, the world of dplyr is vast and constantly evolving. We are aware that all this *embracing* and *tidyselect* rules might be intimidating, but we will continue to explore more facets of the tidyverse in future posts of "basic advanceds", aiming to empower you with advanced techniques that enhance your data analysis journey.
225+
Throughout this post, we ventured deep into some of the intricacies of dplyr. We've unraveled how the package strives to make our code both semantic and syntactic, all while simplifying complex operations. The power of symbols and the utility of functions and pronouns like `all_of()` and `.data` demonstrate just how dynamic and adaptable dplyr can be, especially when working with variable column names. While we've covered much ground, the world of dplyr is vast and constantly evolving. We are aware that all this *embracing* and *tidyselect* rules might be intimidating, but we will continue to explore more facets of the tidyverse in future posts of "basic advanceds", aiming to empower you with advanced techniques that enhance your data analysis journey.
188226

189227
If you've found this post enlightening and wish to delve deeper, or if you have any questions or insights, we'd love to hear from you! You can contact us directly via [X](https://twitter.com/Rturtletopia). Alternatively, for those who prefer a more open-source avenue, feel free to open an issue on our [GitHub](https://github.com/turtletopia/turtletopia.github.io/issues) repository. Your feedback and insights not only help us improve, but they also contribute to the broader data science community.
190228

191229
Until next time, keep exploring, learning, and sharing!
192230

193231
## Dive Deeper: Resources for the Curious Minds:
194232

195-
For those wishing to delve further or who may have lingering questions: [dplyr official programming guide](https://dplyr.tidyverse.org/articles/programming.html)
233+
For those wishing to delve further or who may have lingering questions a great resource would be [dplyr official programming guide](https://dplyr.tidyverse.org/articles/programming.html). If this is still not enough for you, we recommend a few chapters of [Advanced R book](https://adv-r.hadley.nz/metaprogramming.html) that focus on metaprogramming and underlying tools used to build tidyverse.

0 commit comments

Comments
 (0)