tidyverse · hadley · Jun 19, 2025 · Feb 11, 2025 · Feb 11, 2025 · Feb 11, 2025
diff --git a/content/blog/duckplyr-1-0-0/index.Rmd b/content/blog/duckplyr-1-0-0/index.Rmd
@@ -0,0 +1,249 @@
+---
+output: hugodown::hugo_document
+
+slug: duckplyr-1-0-0
+title: duckplyr fully joins the tidyverse!
+date: 2025-02-11
+author: Kirill Müller and Maëlle Salmon
+description: >
+    duckplyr 1.0.0 is on CRAN and part of the tidyverse! duckplyr is a drop-in
+    replacement for dplyr, powered by DuckDB for speed.
+
+photo:
+  url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/
+  author: Kiril Gruev
+
+# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other"
+categories: [package] 
+tags:
+  - duckplyr
+  - dplyr
+  - tidyverse
+---
+
+<!--
+TODO:
+* [x] Look over / edit the post's title in the yaml
+* [x] Edit (or delete) the description; note this appears in the Twitter card
+* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
+* [x] Find photo & update yaml metadata
+* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
+* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
+* [ ] `hugodown::use_tidy_thumbnails()`
+* [x] Add intro sentence, e.g. the standard tagline for the package
+* [x] `usethis::use_tidy_thanks()`
+-->
+
+We're very chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.0.0. 
+duckplyr is a drop-in replacement for dplyr, powered by [DuckDB](https://duckdb.org/) for speed.
+It joins the rank of dplyr backends together with [dtplyr](https://dtplyr.tidyverse.org) and [dbplyr](https://dbplyr.tidyverse.org).
+
+You can install it from CRAN with:
+
+```{r, eval = FALSE}
+install.packages("duckplyr")
+```
+
+In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package.
+
+## A drop-in replacement for dplyr
+
+The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
+You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.
+
+```{r}
+library(conflicted)
+library(duckplyr)
+conflict_prefer("filter", "dplyr", quiet = TRUE)
+library("babynames")
+
+
+out <- babynames |>
+  filter(n > 1000) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+class(out)
+
+```
+
+The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:
+
+-   When using dplyr, we are not used to explicitly collect results, we simply access them: the data.frames are "eager" by default.
+    Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
+    The collection of results, called materialization, has to be automatic by default.
+    Therefore, _duckplyr needs eagerness_!
+
+-   The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
+    _Therefore, duckplyr needs laziness_!
+
+As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports *deferred evaluation*.
+
+> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.
+
+If the duckplyr data.frame is accessed by...
+
+-   duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
+-   not duckplyr (say, ggplot2, or `nrow()`), then a special callback is executed, allowing materialization of the data frame.
+
+Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).
+
+Now, the default automatic materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
+Therefore, the duckplyr package has a safeguard called `prudence` with three levels.
+
+- `"lavish"`: automatically materialize _regardless of size_,
+
+```{r}
+out <- babynames |>
+  duckdb_tibble(prudence = "lavish") |> # default value of prudence :-)
+  filter(n > 1000) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+
+class(out)
+nrow(out)
+```
+
+- `"stingy"`: _never_ automatically materialize,
+
+```{r, error = TRUE}
+stingy <- babynames |>
+  duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
+  filter(n > 1500) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+
+class(stingy)
+nrow(stingy)
+```
+
+- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.
+
+```{r, error = TRUE}
+thrifty <- babynames |>
+  duckdb_tibble(prudence = "stingy") |> 
+  filter(n > 1000) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+
+class(thrifty)
+nrow(thrifty)
+```
+
+By default, duckplyr data frames are _lavish_, but duckplyr data frames created from Parquet data (presumedly large) are _thrifty_.
+
+## How to use duckplyr
+
+To _replace_ dplyr with duckplyr, you can either
+
+- load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.
+
+```{r}
+library(conflicted)
+library(duckplyr)
+conflict_prefer("filter", "dplyr", quiet = TRUE)
+```
+
+- Create individual "duck frames" which allows you to control their automatic materialization parameters. To do so, you can use _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`.
+
+Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
+The duckplyr package performs the computation using DuckDB.
+
+```{r}
+library("babynames")
+out <- babynames |>
+  filter(n > 1000) |>
+  summarize(
+    .by = c(sex, year),
+    babies_n = sum(n)
+  ) |>
+  filter(sex == "F")
+```
+
+The result can finally be materialized to memory, or computed temporarily, or computed to a file.
+
+```{r}
+# to memory
+out
+
+# to a file
+csv_file <- withr::local_tempfile()
+file.size(csv_file)
+compute_csv(out, csv_file)
+file.size(csv_file)
+```
+
+When duckplyr itself does not support specific functionality, it falls back to dplyr.
+For instance, row names are not supported yet:
+
+```{r}
+mtcars |>
+  summarize(
+    .by = cyl,
+    disp = mean(disp, na.rm = TRUE),
+    sd = sd(disp, na.rm = TRUE)
+  )
+```
+
+Current limitations are documented in a vignette.
+You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).
+
+### For large data
+
+For large data, duckplyr is a worthy alternative to dtplyr and dbplyr.
+
+With large datasets, you want:
+
+- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
+- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
+- the output to not clutter all the memory. Therefore you can make use of these features:
+    - the `prudence` parameter, to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
+    - computation to files using  `compute_parquet()` or `compute_csv()`.
+
+A drawback of analyzing large data with duckplyr is that the limitations of duckplyr won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
+Therefore, if your pipeline encounters fallbacks, you might want to workaround them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function.
+
+```{r}
+data <-
+  duckdb_tibble(a = 2) |>
+  mutate(b = 3)
+
+computed_data <-
+  data |>
+  compute(name = "computed_data")
+
+sql_data <-
+  read_sql_duckdb("SELECT *, a * b AS c FROM computed_data")
+
+sql_data
+```
+
+## Help us improve duckplyr!
+
+Our goals for future development of duckplyr include:
+
+- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality;
+- Making it easier to contribute code to duckplyr.
+
+You can already help though, in three main ways:
+
+- Please report any issue especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html).
+- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html).
+- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See [`vignette("telemetry")`](https://duckplyr.tidyverse.org/articles/telemetry.html) and the [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html) function.
+
+## Acknowledgements
+
+A big thanks to all 54 folks who filed issues, created PRs and generally helped to improve duckplyr!
+
+[&#x0040;adamschwing](https://github.com/adamschwing), [&#x0040;andreranza](https://github.com/andreranza), [&#x0040;apalacio9502](https://github.com/apalacio9502), [&#x0040;apsteinmetz](https://github.com/apsteinmetz), [&#x0040;barracuda156](https://github.com/barracuda156), [&#x0040;beniaminogreen](https://github.com/beniaminogreen), [&#x0040;bob-rietveld](https://github.com/bob-rietveld), [&#x0040;brichards920](https://github.com/brichards920), [&#x0040;cboettig](https://github.com/cboettig), [&#x0040;davidjayjackson](https://github.com/davidjayjackson), [&#x0040;DavisVaughan](https://github.com/DavisVaughan), [&#x0040;Ed2uiz](https://github.com/Ed2uiz), [&#x0040;eitsupi](https://github.com/eitsupi), [&#x0040;era127](https://github.com/era127), [&#x0040;etiennebacher](https://github.com/etiennebacher), [&#x0040;eutwt](https://github.com/eutwt), [&#x0040;fmichonneau](https://github.com/fmichonneau), [&#x0040;github-actions[bot]](https://github.com/github-actions[bot]), [&#x0040;hadley](https://github.com/hadley), [&#x0040;hannes](https://github.com/hannes), [&#x0040;hawkfish](https://github.com/hawkfish), [&#x0040;IndrajeetPatil](https://github.com/IndrajeetPatil), [&#x0040;JanSulavik](https://github.com/JanSulavik), [&#x0040;JavOrraca](https://github.com/JavOrraca), [&#x0040;jeroen](https://github.com/jeroen), [&#x0040;jhk0530](https://github.com/jhk0530), [&#x0040;joakimlinde](https://github.com/joakimlinde), [&#x0040;JosiahParry](https://github.com/JosiahParry), [&#x0040;krlmlr](https://github.com/krlmlr), [&#x0040;larry77](https://github.com/larry77), [&#x0040;lnkuiper](https://github.com/lnkuiper), [&#x0040;lorenzwalthert](https://github.com/lorenzwalthert), [&#x0040;luisDVA](https://github.com/luisDVA), [&#x0040;maelle](https://github.com/maelle), [&#x0040;math-mcshane](https://github.com/math-mcshane), [&#x0040;meersel](https://github.com/meersel), [&#x0040;multimeric](https://github.com/multimeric), [&#x0040;mytarmail](https://github.com/mytarmail), [&#x0040;nicki-dese](https://github.com/nicki-dese), [&#x0040;PMassicotte](https://github.com/PMassicotte), [&#x0040;prasundutta87](https://github.com/prasundutta87), [&#x0040;rafapereirabr](https://github.com/rafapereirabr), [&#x0040;Robinlovelace](https://github.com/Robinlovelace), [&#x0040;romainfrancois](https://github.com/romainfrancois), [&#x0040;sparrow925](https://github.com/sparrow925), [&#x0040;stefanlinner](https://github.com/stefanlinner), [&#x0040;thomasp85](https://github.com/thomasp85), [&#x0040;TimTaylor](https://github.com/TimTaylor), [&#x0040;Tmonster](https://github.com/Tmonster), [&#x0040;toppyy](https://github.com/toppyy), [&#x0040;wibeasley](https://github.com/wibeasley), [&#x0040;yjunechoe](https://github.com/yjunechoe), [&#x0040;ywhcuhk](https://github.com/ywhcuhk), and [&#x0040;zhjx19](https://github.com/zhjx19).