You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/duckplyr-1-0-0/index.Rmd
+86-9Lines changed: 86 additions & 9 deletions
Original file line number
Diff line number
Diff line change
@@ -49,13 +49,15 @@ In this article, we'll introduce you to the basic concepts behind duckplyr, show
49
49
## A drop-in replacement for dplyr
50
50
51
51
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
52
-
You can simply drop duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
52
+
You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
53
53
54
54
```{r}
55
55
library(conflicted)
56
56
library(duckplyr)
57
57
conflict_prefer("filter", "dplyr", quiet = TRUE)
58
58
library("babynames")
59
+
60
+
59
61
out <- babynames |>
60
62
filter(n > 1000) |>
61
63
summarize(
@@ -69,7 +71,7 @@ class(out)
69
71
70
72
The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
71
73
72
-
- When using dplyr, we are not used to explicitly collect results, we just access them: the data.frames are eager by default.
74
+
- When using dplyr, we are not used to explicitly collect results, we simply access them: the data.frames are "eager" by default.
73
75
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
74
76
Therefore, _duckplyr needs eagerness_!
75
77
@@ -83,35 +85,80 @@ As a consequence, duckplyr is lazy on the inside for all DuckDB operations but e
83
85
If the duckplyr data.frame is accessed by...
84
86
85
87
- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
86
-
- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
88
+
- not duckplyr (say, ggplot2, or `nrow()`), then a special callback is executed, allowing materialization of the data frame.
87
89
88
90
Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
89
91
90
92
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
91
93
Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
92
94
93
95
-`"lavish"`: automatically materialize _regardless of size_,
94
-
-`"stingy"` (like the famous duck Uncle Scrooge): _never_ automatically materialize,
96
+
97
+
```{r}
98
+
out <- babynames |>
99
+
duckdb_tibble(prudence = "lavish") |> # default value of prudence :-)
100
+
filter(n > 1000) |>
101
+
summarize(
102
+
.by = c(sex, year),
103
+
babies_n = sum(n)
104
+
) |>
105
+
filter(sex == "F")
106
+
107
+
class(out)
108
+
nrow(out)
109
+
```
110
+
111
+
-`"stingy"`: _never_ automatically materialize,
112
+
113
+
```{r, error = TRUE}
114
+
stingy <- babynames |>
115
+
duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
116
+
filter(n > 1500) |>
117
+
summarize(
118
+
.by = c(sex, year),
119
+
babies_n = sum(n)
120
+
) |>
121
+
filter(sex == "F")
122
+
123
+
class(stingy)
124
+
nrow(stingy)
125
+
```
126
+
95
127
-`"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
96
128
97
-
By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
129
+
```{r, error = TRUE}
130
+
out <- babynames |>
131
+
duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
132
+
filter(n > 1000) |>
133
+
summarize(
134
+
.by = c(sex, year),
135
+
babies_n = sum(n)
136
+
) |>
137
+
filter(sex == "F")
138
+
139
+
class(out)
140
+
nrow(out)
141
+
```
142
+
143
+
By default, duckplyr data frames are _lavish_, but duckplyr data frames created from Parquet data (presumedly large) are _thrifty_.
98
144
99
145
## How to use duckplyr
100
146
147
+
To _replace_ dplyr with duckplyr, you can either
101
148
102
-
First, data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
103
-
Alternatively, calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
149
+
- load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
104
150
105
-
```{r load}
151
+
```r
106
152
library(conflicted)
107
153
library(duckplyr)
108
154
conflict_prefer("filter", "dplyr", quiet=TRUE)
109
155
```
110
156
157
+
- Create individual "duck frames" which allows you to control their automatic materialization parameters. To do so, you can use _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`.
158
+
111
159
Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
112
160
The duckplyr package performs the computation using DuckDB.
113
161
114
-
115
162
```{r}
116
163
library("babynames")
117
164
out <- babynames |>
@@ -151,6 +198,36 @@ mtcars |>
151
198
Current limitations are documented in a vignette.
152
199
You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).
153
200
201
+
### For large data
202
+
203
+
For large data, duckplyr is a worthy alternative to dtplyr and dbplyr.
204
+
205
+
With large datasets, you want:
206
+
207
+
- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
208
+
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
209
+
- the output to not clutter all the memory. Therefore you can make use of these features:
210
+
- the `prudence` parameter, to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
211
+
- computation to files using `compute_parquet()` or `compute_csv()`.
212
+
213
+
A drawback of analyzing large data with duckplyr is that the limitations of duckplyr won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
214
+
Therefore, if your pipeline encounters fallbacks, you might want to workaround them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function.
215
+
216
+
```{r}
217
+
data <-
218
+
duckdb_tibble(a = 2) |>
219
+
mutate(b = 3)
220
+
221
+
computed_data <-
222
+
data |>
223
+
compute(name = "computed_data")
224
+
225
+
sql_data <-
226
+
read_sql_duckdb("SELECT *, a * b AS c FROM computed_data")
227
+
228
+
sql_data
229
+
```
230
+
154
231
## Help us improve duckplyr!
155
232
156
233
Our goals for future development of duckplyr include:
0 commit comments