Skip to content

Commit bc143a0

Browse files
committed
rework content of the article
1 parent c7a81ec commit bc143a0

File tree

4 files changed

+1494
-227
lines changed

4 files changed

+1494
-227
lines changed
Lines changed: 139 additions & 126 deletions
Original file line numberDiff line numberDiff line change
@@ -1,126 +1,139 @@
1-
---
2-
title: 'R Basic Advanceds: Variables and Names in dplyr'
3-
author: Dominik Rafacz
4-
date: '2023-01-30'
5-
slug: r-basic-advanceds-variables-and-names-in-dplyr
6-
categories: ['Tutorial']
7-
tags: ['r', 'tutorial', 'dplyr', 'environments', 'rlang']
8-
---
9-
10-
# Intro
11-
12-
Hello everyone! We've had quite a long hiatus caused by various life things (including graduating from college, changing jobs and lawsuits), but we're back and want to bring the blog back to life. And since the basics of advanced methods is what interests me the most, today I'm going to introduce you to a post about something that sooner or later every dplyr user will encounter.
13-
14-
```{r message=FALSE, warning=FALSE, include=FALSE}
15-
library(dplyr)
16-
iris <- iris %>% slice(1:5)
17-
```
18-
19-
# Problem 1: symbols treated as names vs variables containing strings with names
20-
21-
dplyr verbs are very convenient, because we can use *symbols* instead of constantly accessing the data with *strings with names*. E.g. compare the code for selecting columns in data frame in base R and in dplyr:
22-
23-
```{r eval=FALSE}
24-
# base
25-
iris[, c("Sepal.Length", "Sepal.Width")]
26-
27-
# dplyr
28-
iris %>%
29-
select(Sepal.Length, Sepal.Width)
30-
```
31-
32-
We can see the clear difference between the base style and the dplyr style:
33-
34-
* `"Sepal.Length", "Sepal.Width"` -- strings with names, they are quoted by " or '
35-
* `Sepal.Length, Sepal.Width` -- symbols, they are not quoted (or quoted using \` character, if they contain spaces)
36-
37-
In the second case *symbols* are used to access columns in data.frame (it is not important how that works in details, we will take care of that in another post). It is crucial to know this difference and understand it, as it might help us not to fall into the traps which I will discuss in the rest of the post.
38-
39-
dplyr introduced this syntax, because it is simpler. Less characters to type is always nice, especially when we are using column names really often, which usually is the case.
40-
41-
Sometimes, however, we do not know names of columns to select in advance. E.g., we have an external variable containing names of columns to select, like this:
42-
43-
```{r}
44-
my_variables <- c("Sepal.Length", "Sepal.Width")
45-
```
46-
47-
We can provide it to `select` directly:
48-
49-
```{r warning=TRUE}
50-
iris %>%
51-
select(my_variables)
52-
```
53-
54-
but it throws a warning. tidyverse boasts detailed messages and it is worth always taking them into consideration. This is not recommended, as it is ambiguous. We can easily imagine situation when there is a column in data called "my_variables". What should happen, if we have both such a column and such an external variable? Which one would be selected? The answer does not matter for us, as we want to stick to best practices.
55-
56-
A solution suggested by authors of dplyr is to use `dplyr::all_of()`, which explicitly transforms a vector of names (or a single name) into symbols. In this way there is no ambiguity -- dplyr knows that it should use columns named by vector `my_variables`, not use `my_variables` as a name of column.
57-
58-
```{r warning=TRUE}
59-
iris %>%
60-
select(all_of(my_variables))
61-
```
62-
63-
## Problem 2: passing columns as arguments to custom functions
64-
65-
This difference between passing a variable name vs or symbol as a name can be especially tricky when building a function which calls dplyr verbs inside of it and is parametrized by column names. Let's see an example:
66-
67-
```{r}
68-
my_subset <- function(data, my_var) {
69-
data %>%
70-
select(my_var)
71-
}
72-
```
73-
74-
75-
This might cause a lot of issues. Should we provide a string as a name (`my_subset(iris, "Sepal.Length")`) or a symbol (`my_subset(iris, Sepal.Length)`)? To answer this question, we should first be clear about our intent (it would be nice to write a few words of documentation -- for other users or for ourselves in the future). Both approaches are possible, but it is better to stick to one.
76-
77-
If we decide that we want to use names as a string (it is a common case, e.g. when building shiny app and columns are selected in inputs), then we should use previously shown trick with `dplyr::all_of()`:
78-
79-
80-
```{r, eval=FALSE}
81-
my_subset_with_strings <- function(data, my_var_as_string) {
82-
data %>%
83-
select(all_of(my_var_as_string))
84-
}
85-
86-
my_subset_with_strings(iris, c("Sepal.Length", "Sepal.Width"))
87-
```
88-
89-
If we want to use symbols, just like directly in dplyr functions (mostly when those columns to use are predefined, in our internal functions or analyses), we have to *embrace* the variable:
90-
91-
```{r, eval=FALSE}
92-
my_subset_with_symbols <- function(data, my_var_as_symbol) {
93-
data %>%
94-
select({{ my_var_as_symbol }})
95-
}
96-
97-
my_subset_with_symbols(iris, Petal.Length)
98-
99-
my_subset_with_symbols(iris, Petal.Length, Sepal.Width)
100-
101-
iris %>%
102-
select(Petal.Length)
103-
104-
iris %>%
105-
select(Petal.Length, Sepal.Width)
106-
107-
108-
my_var_as_symbol = Petal.Length
109-
iris %>%
110-
select(my_var_as_symbol)
111-
112-
```
113-
114-
In this way we let dplyr know that `my_var_as_symbol` has to be passed directly as user provided it. We can think of embracing as of cut-paste operation. We tell dplyr: "Take what user provided in place of `my_var_as_symbol` in function call and plug it directly into `select`, without creating any intermediate variables.". Call to `my_subset_with_symbols()` is basically replaced with what lies inside of it. You can see comparison in figure TODO.
115-
116-
# Problem 3: dynamic columns in purrr formulas in `across`
117-
118-
Solutions above work fine when we provide those column names to `dplyr::select()`, `dplyr::filter()` or `dplyr::group_by()` directly. But sometimes we need a function that does something more. Let's say we want to have a function `do_magic`, which takes as an input some data, name of column `special` and names of other columns `others`. This function subtracts column `special` from all columns `others`.
119-
120-
We can try to do it with `dplyr::mutate(dplyr::across())`. It has a syntax `mutate(across(columns_to_mutate, function_to_apply))`. If we want to provide a custom unnamed function, we can use *purrr formula syntax*: `~ expression with .x` where `.x` is column. This (without enclosing it in a function yet) could look like:
121-
122-
```{r, eval=FALSE}
123-
iris %>%
124-
mutate(across(all_of(c("Sepal.Length", "Sepal.Width")), ~ .x - Petal.Length))
125-
126-
```
1+
---
2+
title: 'R Basic Advanceds: Variables and Names in dplyr'
3+
author: Dominik Rafacz
4+
date: '2023-01-30'
5+
slug: r-basic-advanceds-variables-and-names-in-dplyr
6+
categories: ['Tutorial']
7+
tags: ['r', 'tutorial', 'dplyr', 'environments', 'rlang']
8+
---
9+
10+
# Intro
11+
12+
Hello everyone! After an extended hiatus for various reasons (from graduating college to navigating job changes and legal challenges), we're back and eager to breathe new life into this blog. Given my deep interest in the fundamentals of advanced methods, today we're delving into an essential topic every dplyr user will eventually face.
13+
14+
dplyr is meticulously designed with the primary goal of making code workflows read as naturally and close to plain language as possible. This design philosophy manifests in two critical dimensions: *semantic* and *syntactic*.
15+
16+
Semantically, the emphasis is on **employing words with intuitive and easily understood meanings**. For instance, dplyr and its friends adhere to a robust naming convention where function names typically take on verb forms, elucidating the action they perform.
17+
18+
Syntactically, the **arrangement and combination of these descriptive words is paramount**. Arguably, this is even more critical to the user experience. One of the most evident manifestations of this syntactical approach is the tidyverse's hallmark feature: **the pipe operator**. But we are not going to tackle this today. I will look into caveats of another essential and intuitive syntactic feature: the **use of symbols instead of strings to refer to variables within datasets**. This offers a more natural-feeling mode of interaction but, as I have found out over many years of using R, this feature can lead to some problems.
19+
20+
21+
```{r message=FALSE, warning=FALSE, include=FALSE}
22+
library(dplyr)
23+
iris <- iris %>% slice(1:5)
24+
```
25+
26+
# Problem 1: Symbols vs. strings with names
27+
28+
Let's compare how we select columns in a data frame using base R versus dplyr:
29+
30+
```{r eval=FALSE}
31+
# base
32+
iris[, c("Sepal.Length", "Sepal.Width")]
33+
34+
# dplyr
35+
iris %>%
36+
select(Sepal.Length, Sepal.Width)
37+
```
38+
39+
Notice the difference:
40+
41+
* In base R, we use `"Sepal.Length", "Sepal.Width"`, which are **strings** enclosed in quotes (single and double quotes are both valid).
42+
* With dplyr, we have `Sepal.Length, Sepal.Width`, unquoted **symbols**.
43+
44+
In the second case *symbols* are used to access columns in a data frame, just like we use symbols to access any variable or function that we store in our top-level environments.
45+
It is vital to grasp this distinction to sidestep potential pitfalls. which I will discuss in the rest of the post.
46+
47+
So, what symbols actually are? We use them as names of objects and this is the identity of their core. This is why it feels natural to use them to not only access top-level variables, but also variables in data. There is more to the nature of symbols, but we will come back to that later.
48+
49+
Notice that dplyr is smart enough to let you select variables by strings as well:
50+
51+
```{r}
52+
iris %>%
53+
select("Sepal.Length", "Sepal.Width")
54+
```
55+
56+
This is, however, inadvisable, as this is exactly what tidyverse designers wanted to avoid.
57+
58+
Now, consider a scenario where we have an external variable storing column names:
59+
60+
```{r}
61+
my_variables <- c("Sepal.Length", "Sepal.Width")
62+
```
63+
64+
Although it might seem intuitive to directly supply it to select:
65+
66+
```{r warning=TRUE}
67+
iris %>%
68+
select(my_variables)
69+
```
70+
71+
This generates a warning. Given the tidyverse's informative error messages, it's wise to pay heed. Directly supplying can be ambiguous —- imagine having a column named "my_variables". Which should be selected if we have both the column and the external variable?
72+
73+
74+
![Diagram showing the dillema that dplyr is faced with when we torment it with ambiguous selections.](/images/selection-ambiguity.png)
75+
To ensure clarity, dplyr authors suggest using dplyr::all_of(), which explicitly converts a name vector into symbols, resolving any ambiguities.
76+
77+
```{r warning=TRUE}
78+
iris %>%
79+
select(all_of(my_variables))
80+
```
81+
82+
## Problem 2: Passing column names as arguments to custom functions
83+
84+
Differentiating between passing a variable name or a symbol becomes trickier when constructing functions that internally use dplyr verbs. Consider:
85+
86+
```{r}
87+
my_subset <- function(data, my_var) {
88+
data %>%
89+
select(my_var)
90+
}
91+
```
92+
93+
This might cause a lot of issues. Should we provide a string as a name (`my_subset(iris, "Sepal.Length")`) or a symbol (`my_subset(iris, Sepal.Length)`)? To answer this question, **we should first be clear about our intent** (it would be nice to write a few words of documentation -- for other users or for ourselves in the future). **Both approaches are possible and valid**. It is important to **choose one and remain consistent** across all functions that we write.
94+
95+
For instances where column names are passed as strings (common in Shiny apps when columns are selected by some input), one could utilize the previously discussed `dplyr::all_of()`:
96+
97+
98+
```{r, eval=FALSE}
99+
my_subset_with_strings <- function(data, my_var_as_string) {
100+
data %>%
101+
select(all_of(my_var_as_string))
102+
}
103+
104+
my_subset_with_strings(iris, c("Sepal.Length", "Sepal.Width"))
105+
```
106+
107+
If we want to use symbols, just like directly in dplyr functions (mostly when those columns to use are predefined, in our internal functions or analyses), we have to *embrace* the variable:
108+
109+
```{r, eval=FALSE}
110+
my_subset_with_symbols <- function(data, my_var_as_symbol) {
111+
data %>%
112+
select({{ my_var_as_symbol }})
113+
}
114+
115+
my_subset_with_symbols(iris, Petal.Length)
116+
117+
my_subset_with_symbols(iris, Petal.Length, Sepal.Width)
118+
```
119+
120+
In this way we let dplyr know that `my_var_as_symbol` has to be passed directly as user provided it. We can think of embracing as of cut-paste operation. We tell dplyr: "Take what user provided in place of `my_var_as_symbol` in function call and plug it directly into `select`, without creating any intermediate variables.". Call to `my_subset_with_symbols()` is basically replaced with what lies inside of it.
121+
122+
# Problem 3: dynamic columns in purrr formulas in `across`
123+
124+
While the above solutions work seamlessly with functions like `dplyr::select()`, challenges arise when operations grow complex. Suppose we wish to craft a function, `do_magic`, that takes data, a special `column`, and several `others` columns. This function should add the special column to all others.
125+
126+
Leveraging `dplyr::mutate(dplyr::across())` can achieve this. Its syntax is:
127+
128+
```{r eval=FALSE}
129+
mutate(across(columns_to_mutate, function_to_apply))
130+
```
131+
132+
For custom, unnamed functions, the *purrr formula syntax* (`~ expression` with `.x`) is beneficial. In our case (without enclosing it in a function yet) could look like:
133+
134+
```{r, eval=FALSE}
135+
iris %>%
136+
mutate(across(all_of(c("Sepal.Length", "Sepal.Width")), ~ .x - Petal.Length))
137+
```
138+
139+
However, contrary to most languages, in R **symbols can be treated as objects themselves**. This allows dplyr to even perform such simplifications. The details are irrelevant now

0 commit comments

Comments
 (0)