Skip to content

Commit 5b2b472

Browse files
committed
restructure, better to explain some things first?!
1 parent 96861b7 commit 5b2b472

File tree

1 file changed

+52
-8
lines changed

1 file changed

+52
-8
lines changed

content/blog/duckplyr-1-0-0/index.Rmd

Lines changed: 52 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -44,12 +44,60 @@ You can install it from CRAN with:
4444
install.packages("duckplyr")
4545
```
4646

47-
In this article, we'll introduce you to the basic usage of duckplyr, show how it can help you handle large data, and explain how you can help improve the package.
48-
47+
In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package.
4948

5049
## A drop-in replacement for dplyr
5150

52-
The duckplyr package is a drop-in replacement for dplyr that uses DuckDB for speed.
51+
The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
52+
You can simply drop duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
53+
54+
```{r}
55+
library(conflicted)
56+
library(duckplyr)
57+
conflict_prefer("filter", "dplyr", quiet = TRUE)
58+
library("babynames")
59+
out <- babynames |>
60+
filter(n > 1000) |>
61+
summarize(
62+
.by = c(sex, year),
63+
babies_n = sum(n)
64+
) |>
65+
filter(sex == "F")
66+
class(out)
67+
68+
```
69+
70+
The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
71+
72+
- When using dplyr, we are not used to explicitly collect results, we just access them: the data.frames are eager by default.
73+
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
74+
Therefore, _duckplyr needs eagerness_!
75+
76+
- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
77+
_Therefore, duckplyr needs laziness_!
78+
79+
As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports *deferred evaluation*.
80+
81+
> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.
82+
83+
If the duckplyr data.frame is accessed by...
84+
85+
- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
86+
- not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
87+
88+
Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
89+
90+
Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
91+
Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
92+
93+
- `"lavish"`: automatically materialize _regardless of size_,
94+
- `"stingy"` (like the famous duck Uncle Scrooge): _never_ automatically materialize,
95+
- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
96+
97+
By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
98+
99+
## How to use duckplyr
100+
53101

54102
First, data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
55103
Alternatively, calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
@@ -61,7 +109,7 @@ conflict_prefer("filter", "dplyr", quiet = TRUE)
61109
```
62110

63111
Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
64-
The duckplyr package performs the computation using DuckDB, or, if a specific operation is not supported, fallbacks to dplyr.
112+
The duckplyr package performs the computation using DuckDB.
65113

66114

67115
```{r}
@@ -103,10 +151,6 @@ mtcars |>
103151
Current limitations are documented in a vignette.
104152
You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).
105153

106-
107-
108-
## A handy tool for large data
109-
110154
## Help us improve duckplyr!
111155

112156
Our goals for future development of duckplyr include:

0 commit comments

Comments
 (0)