Skip to content

Commit 97739a8

Browse files
committed
oops
1 parent 5b2b472 commit 97739a8

File tree

2 files changed

+410
-9
lines changed

2 files changed

+410
-9
lines changed

content/blog/duckplyr-1-0-0/index.Rmd

Lines changed: 86 additions & 9 deletions
Original file line numberDiff line numberDiff line change
@@ -49,13 +49,15 @@ In this article, we'll introduce you to the basic concepts behind duckplyr, show
4949
## A drop-in replacement for dplyr
5050

5151
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
52-
You can simply drop duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
52+
You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
5353

5454
```{r}
5555
library(conflicted)
5656
library(duckplyr)
5757
conflict_prefer("filter", "dplyr", quiet = TRUE)
5858
library("babynames")
59+
60+
5961
out <- babynames |>
6062
filter(n > 1000) |>
6163
summarize(
@@ -69,7 +71,7 @@ class(out)
6971

7072
The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
7173

72-
- When using dplyr, we are not used to explicitly collect results, we just access them: the data.frames are eager by default.
74+
- When using dplyr, we are not used to explicitly collect results, we simply access them: the data.frames are "eager" by default.
7375
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
7476
Therefore, _duckplyr needs eagerness_!
7577

@@ -83,35 +85,80 @@ As a consequence, duckplyr is lazy on the inside for all DuckDB operations but e
8385
If the duckplyr data.frame is accessed by...
8486

8587
- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
86-
- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
88+
- not duckplyr (say, ggplot2, or `nrow()`), then a special callback is executed, allowing materialization of the data frame.
8789

8890
Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
8991

9092
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
9193
Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
9294

9395
- `"lavish"`: automatically materialize _regardless of size_,
94-
- `"stingy"` (like the famous duck Uncle Scrooge): _never_ automatically materialize,
96+
97+
```{r}
98+
out <- babynames |>
99+
duckdb_tibble(prudence = "lavish") |> # default value of prudence :-)
100+
filter(n > 1000) |>
101+
summarize(
102+
.by = c(sex, year),
103+
babies_n = sum(n)
104+
) |>
105+
filter(sex == "F")
106+
107+
class(out)
108+
nrow(out)
109+
```
110+
111+
- `"stingy"`: _never_ automatically materialize,
112+
113+
```{r, error = TRUE}
114+
stingy <- babynames |>
115+
duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
116+
filter(n > 1500) |>
117+
summarize(
118+
.by = c(sex, year),
119+
babies_n = sum(n)
120+
) |>
121+
filter(sex == "F")
122+
123+
class(stingy)
124+
nrow(stingy)
125+
```
126+
95127
- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
96128

97-
By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
129+
```{r, error = TRUE}
130+
out <- babynames |>
131+
duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
132+
filter(n > 1000) |>
133+
summarize(
134+
.by = c(sex, year),
135+
babies_n = sum(n)
136+
) |>
137+
filter(sex == "F")
138+
139+
class(out)
140+
nrow(out)
141+
```
142+
143+
By default, duckplyr data frames are _lavish_, but duckplyr data frames created from Parquet data (presumedly large) are _thrifty_.
98144

99145
## How to use duckplyr
100146

147+
To _replace_ dplyr with duckplyr, you can either
101148

102-
First, data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
103-
Alternatively, calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
149+
- load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
104150

105-
```{r load}
151+
```r
106152
library(conflicted)
107153
library(duckplyr)
108154
conflict_prefer("filter", "dplyr", quiet = TRUE)
109155
```
110156

157+
- Create individual "duck frames" which allows you to control their automatic materialization parameters. To do so, you can use _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`.
158+
111159
Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
112160
The duckplyr package performs the computation using DuckDB.
113161

114-
115162
```{r}
116163
library("babynames")
117164
out <- babynames |>
@@ -151,6 +198,36 @@ mtcars |>
151198
Current limitations are documented in a vignette.
152199
You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).
153200

201+
### For large data
202+
203+
For large data, duckplyr is a worthy alternative to dtplyr and dbplyr.
204+
205+
With large datasets, you want:
206+
207+
- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
208+
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
209+
- the output to not clutter all the memory. Therefore you can make use of these features:
210+
- the `prudence` parameter, to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
211+
- computation to files using `compute_parquet()` or `compute_csv()`.
212+
213+
A drawback of analyzing large data with duckplyr is that the limitations of duckplyr won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
214+
Therefore, if your pipeline encounters fallbacks, you might want to workaround them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function.
215+
216+
```{r}
217+
data <-
218+
duckdb_tibble(a = 2) |>
219+
mutate(b = 3)
220+
221+
computed_data <-
222+
data |>
223+
compute(name = "computed_data")
224+
225+
sql_data <-
226+
read_sql_duckdb("SELECT *, a * b AS c FROM computed_data")
227+
228+
sql_data
229+
```
230+
154231
## Help us improve duckplyr!
155232

156233
Our goals for future development of duckplyr include:

0 commit comments

Comments
 (0)