You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: content/blog/duckplyr-1-0-0/index.Rmd
+52-8Lines changed: 52 additions & 8 deletions
Original file line number
Diff line number
Diff line change
@@ -44,12 +44,60 @@ You can install it from CRAN with:
44
44
install.packages("duckplyr")
45
45
```
46
46
47
-
In this article, we'll introduce you to the basic usage of duckplyr, show how it can help you handle large data, and explain how you can help improve the package.
48
-
47
+
In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package.
49
48
50
49
## A drop-in replacement for dplyr
51
50
52
-
The duckplyr package is a drop-in replacement for dplyr that uses DuckDB for speed.
51
+
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
52
+
You can simply drop duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
53
+
54
+
```{r}
55
+
library(conflicted)
56
+
library(duckplyr)
57
+
conflict_prefer("filter", "dplyr", quiet = TRUE)
58
+
library("babynames")
59
+
out <- babynames |>
60
+
filter(n > 1000) |>
61
+
summarize(
62
+
.by = c(sex, year),
63
+
babies_n = sum(n)
64
+
) |>
65
+
filter(sex == "F")
66
+
class(out)
67
+
68
+
```
69
+
70
+
The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
71
+
72
+
- When using dplyr, we are not used to explicitly collect results, we just access them: the data.frames are eager by default.
73
+
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
74
+
Therefore, _duckplyr needs eagerness_!
75
+
76
+
- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
77
+
_Therefore, duckplyr needs laziness_!
78
+
79
+
As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports *deferred evaluation*.
80
+
81
+
> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.
82
+
83
+
If the duckplyr data.frame is accessed by...
84
+
85
+
- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
86
+
- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
87
+
88
+
Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
89
+
90
+
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
91
+
Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
92
+
93
+
-`"lavish"`: automatically materialize _regardless of size_,
94
+
-`"stingy"` (like the famous duck Uncle Scrooge): _never_ automatically materialize,
95
+
-`"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
96
+
97
+
By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
98
+
99
+
## How to use duckplyr
100
+
53
101
54
102
First, data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
55
103
Alternatively, calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
0 commit comments