restructure, better to explain some things first?!

maelle · maelle · commit 5b2b472c0fc5 · 2025-02-11T11:37:26.000+01:00
diff --git a/content/blog/duckplyr-1-0-0/index.Rmd b/content/blog/duckplyr-1-0-0/index.Rmd
@@ -44,12 +44,60 @@ You can install it from CRAN with:
 install.packages("duckplyr")
 ```
 
-In this article, we'll introduce you to the basic usage of duckplyr, show how it can help you handle large data, and explain how you can help improve the package.
-
+In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package.
 
 ## A drop-in replacement for dplyr
 
-The duckplyr package is a drop-in replacement for dplyr that uses DuckDB for speed.
+The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
+You can simply drop duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
+
+```{r}
+library(conflicted)
+library(duckplyr)
+conflict_prefer("filter", "dplyr", quiet = TRUE)
+library("babynames")
+out <- babynames |>
+  filter(n > 1000) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+class(out)
+
+```
+
+The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
+
+-   When using dplyr, we are not used to explicitly collect results, we just access them: the data.frames are eager by default.
+    Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
+    Therefore, _duckplyr needs eagerness_!
+
+-   The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
+    _Therefore, duckplyr needs laziness_!
+
+As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports *deferred evaluation*.
+
+> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.
+
+If the duckplyr data.frame is accessed by...
+
+-   duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
+-   not duckplyr (say, ggplot2), then a special callback is executed, allowing materialization of the data frame.
+
+Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
+
+Now, the default materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
+Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
+
+- `"lavish"`: automatically materialize _regardless of size_,
+- `"stingy"` (like the famous duck Uncle Scrooge): _never_ automatically materialize,
+- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
+
+By default, duckplyr frames are _lavish_, but duckplyr frames created from Parquet data (presumedly large) are _thrifty_.
+
+## How to use duckplyr
+
 
 First, data is inputted using either conversion (from data in memory) or ingestion (from data in files) functions.
 Alternatively, calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
@@ -61,7 +109,7 @@ conflict_prefer("filter", "dplyr", quiet = TRUE)
 ```
 
 Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
-The duckplyr package performs the computation using DuckDB, or, if a specific operation is not supported, fallbacks to dplyr.
+The duckplyr package performs the computation using DuckDB.
 
 
 ```{r}
@@ -103,10 +151,6 @@ mtcars |>
 Current limitations are documented in a vignette.
 You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).
 
-
-
-## A handy tool for large data
-
 ## Help us improve duckplyr!
 
 Our goals for future development of duckplyr include: