Skip to content
Merged
Show file tree
Hide file tree
Changes from 9 commits
Commits
Show all changes
101 commits
Select commit Hold shift + click to select a range
96861b7
duckplyr 1.0.0
maelle Feb 11, 2025
5b2b472
restructure, better to explain some things first?!
maelle Feb 11, 2025
97739a8
oops
maelle Feb 11, 2025
1d80f56
evaluate...
maelle Feb 11, 2025
11e21c1
format
maelle Feb 11, 2025
eb09f2b
link
maelle Feb 11, 2025
44b7c66
tweak
maelle Feb 11, 2025
20b6a6e
fix mistake
maelle Feb 11, 2025
4ccfa70
link issue
maelle Feb 11, 2025
692b390
thanks @krlmlr
maelle Feb 11, 2025
0df0d53
explicitly mention data size
maelle Feb 11, 2025
4251569
fully compatible
maelle Feb 11, 2025
e30f959
add ref for DuckDB
maelle Feb 11, 2025
1cec2cb
compare to other backends
maelle Feb 11, 2025
b276db4
rm prudence @krlmlr
maelle Feb 11, 2025
7c92ddb
fallbacks
maelle Feb 11, 2025
d8aca67
link to docs
maelle Feb 11, 2025
480551b
more details here
maelle Feb 11, 2025
6adf074
links
maelle Feb 11, 2025
c955826
isn't this a goal too @krlmlr
maelle Feb 11, 2025
1c5153f
link
maelle Feb 11, 2025
78ee84b
more promises :sweat_smile:
maelle Feb 11, 2025
1b55a30
add thumbnail-wd
maelle Feb 11, 2025
ae87276
add thumbnail-sq
maelle Feb 11, 2025
ae8c8e6
check thumbnail things
maelle Feb 11, 2025
7661ad3
lol
maelle Feb 11, 2025
48eba60
phrasing
maelle Feb 11, 2025
8b6d0bd
Space at EOL
krlmlr Feb 13, 2025
e25fb07
Sentence
krlmlr Feb 13, 2025
a0b9b39
FIXME
krlmlr Feb 13, 2025
b90e8af
Shorten
krlmlr Feb 13, 2025
5ac6f6e
Verbose link
krlmlr Feb 13, 2025
14ec2f6
Not dying on this particular hill here
krlmlr Feb 13, 2025
b9a277a
Tweak query, let's see
krlmlr Feb 13, 2025
5762c0a
Prune
krlmlr Feb 13, 2025
6aaf953
This works
krlmlr Feb 13, 2025
f847736
Tweak narrative
krlmlr Feb 13, 2025
c78073f
Choose pivoting as an important op not yet supported
krlmlr Feb 13, 2025
1f898c0
Link style
krlmlr Feb 13, 2025
5a1f22c
aeolus
krlmlr Feb 13, 2025
2b4b421
Help
krlmlr Feb 13, 2025
d97b031
Exclude maintainers
krlmlr Feb 13, 2025
4be5ea9
Thanks
krlmlr Feb 13, 2025
f344b9f
Link
krlmlr Feb 13, 2025
a13315a
Restore narrative
krlmlr Feb 13, 2025
ad9825f
Add vignette link
krlmlr Feb 13, 2025
a734638
FIXME
krlmlr Feb 13, 2025
fc8122d
Date
krlmlr Feb 13, 2025
3211710
Why bother
krlmlr Feb 13, 2025
eea955a
Level
krlmlr Feb 13, 2025
4a20ca3
Move
krlmlr Feb 13, 2025
f5e4a38
Detail
krlmlr Feb 13, 2025
20dff03
TBC
krlmlr Feb 13, 2025
f231c68
Merge pull request #1 from krlmlr/duckplyr-post-krlmlr
maelle Feb 13, 2025
49b4f8b
kill your babies
maelle Feb 13, 2025
6b84b25
.
maelle Feb 13, 2025
452d5f2
typo
maelle Feb 13, 2025
21c2b74
use suggestion without repeating backend that's in the sentence right…
maelle Feb 20, 2025
a21daa5
fix
maelle Feb 20, 2025
9022d7f
specific
maelle Feb 20, 2025
f0563c0
start tweaking
maelle Feb 20, 2025
88846a3
weave benchmark in?
maelle Feb 20, 2025
17cdfc3
just rm
maelle Feb 20, 2025
9e0e496
rm ellipsis + comment on benchmark
maelle Feb 20, 2025
a5ac341
port majority of Kirill's edits
maelle Feb 20, 2025
36a93cb
fix phrasing
maelle Feb 20, 2025
4f88bbd
hide it for real
maelle Feb 20, 2025
ad8866a
Apply suggestions from code review
maelle Feb 21, 2025
beee540
un-hide
maelle Feb 21, 2025
5fe00ac
add this edit of @krlmlr's
maelle Feb 21, 2025
a186621
one fixme
maelle Feb 21, 2025
a994038
new section
maelle Feb 21, 2025
ac86b3a
rephrase
maelle Feb 21, 2025
4d6fa0d
typo
maelle Feb 21, 2025
0582a15
make it work :sweat_smile:
maelle Feb 21, 2025
735e9d9
Recreate environment and re-render
krlmlr Feb 21, 2025
217144a
Bold face
krlmlr Feb 22, 2025
dc07389
"small results processed seamlessly with dplyr" is the main goal of p…
krlmlr Feb 21, 2025
663eda2
Declutter
krlmlr Feb 21, 2025
dd9d20d
Space
krlmlr Feb 21, 2025
0008ac8
Move
krlmlr Feb 21, 2025
ea3fda0
methods_restore() needed only later
krlmlr Feb 21, 2025
9128e16
Wrap
krlmlr Feb 21, 2025
eaee543
Wording
krlmlr Feb 21, 2025
931fd45
lineitem_tbl
krlmlr Feb 21, 2025
5bb1a84
Keep it simple
krlmlr Feb 21, 2025
78729ac
Caveat
krlmlr Feb 21, 2025
0ef4ace
Stress
krlmlr Feb 21, 2025
0f23a74
Use function from the beginning
krlmlr Feb 22, 2025
236d793
Prune
krlmlr Feb 22, 2025
5bc98a6
Section
krlmlr Feb 22, 2025
cf61e47
Explicit verbosity
krlmlr Feb 22, 2025
bfac020
Render
krlmlr Feb 22, 2025
d0ed8f9
Merge branch 'main' into duckplyr-post
krlmlr Feb 23, 2025
51931cb
Final edits
krlmlr Feb 23, 2025
1ddb0e1
Paragraph and comments
krlmlr Feb 23, 2025
85b98cd
Merge branch 'main' into duckplyr-post
krlmlr May 16, 2025
58997b1
Tweak and render
krlmlr May 16, 2025
29f8d0c
Move
krlmlr May 16, 2025
af72705
Polish
hadley Jun 19, 2025
9fbb466
typo fix
maelle Jun 19, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
249 changes: 249 additions & 0 deletions content/blog/duckplyr-1-0-0/index.Rmd
Original file line number Diff line number Diff line change
@@ -0,0 +1,249 @@
---
output: hugodown::hugo_document

slug: duckplyr-1-0-0
title: duckplyr fully joins the tidyverse!
date: 2025-02-11
author: Kirill Müller and Maëlle Salmon
description: >
duckplyr 1.0.0 is on CRAN and part of the tidyverse! duckplyr is a drop-in
replacement for dplyr, powered by DuckDB for speed.

photo:
url: https://www.pexels.com/photo/a-mallard-duck-on-water-6918877/
author: Kiril Gruev

# one of: "deep-dive", "learn", "package", "programming", "roundup", or "other"
categories: [package]
tags:
- duckplyr
- dplyr
- tidyverse
---

<!--
TODO:
* [x] Look over / edit the post's title in the yaml
* [x] Edit (or delete) the description; note this appears in the Twitter card
* [x] Pick category and tags (see existing with `hugodown::tidy_show_meta()`)
* [x] Find photo & update yaml metadata
* [ ] Create `thumbnail-sq.jpg`; height and width should be equal
* [ ] Create `thumbnail-wd.jpg`; width should be >5x height
* [ ] `hugodown::use_tidy_thumbnails()`
* [x] Add intro sentence, e.g. the standard tagline for the package
* [x] `usethis::use_tidy_thanks()`
-->

We're very chuffed to announce the release of [duckplyr](https://duckplyr.tidyverse.org) 1.0.0.
duckplyr is a drop-in replacement for dplyr, powered by [DuckDB](https://duckdb.org/) for speed.
It joins the rank of dplyr backends together with [dtplyr](https://dtplyr.tidyverse.org) and [dbplyr](https://dbplyr.tidyverse.org).

You can install it from CRAN with:

```{r, eval = FALSE}
install.packages("duckplyr")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With tidyverse/tidyverse#346, we can also install.packages("tidyverse") .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, but won't the post be published before the PR is merged and the tidyverse package is released on CRAN?

```

In this article, we'll introduce you to the basic concepts behind duckplyr, show how it can help you handle normal sized but also large data, and explain how you can help improve the package.

## A drop-in replacement for dplyr

The duckplyr package is a _drop-in replacement for dplyr_ that uses _DuckDB for speed_.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you need a sentence or two here as to why it's needed compared to dbplyr.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done below: https://github.com/tidyverse/tidyverse.org/pull/724/files#diff-30be2863e4e092b08c7c1ad595164bec2f424a02f175c56e32f85d3475559a94R102-R104

Like with other dplyr backends like dtplyr and dbplyr, duckplyr allows you to get faster results without learning a different syntax.
Unlike other dplyr backends, duckplyr does not require you to change existing code or learn specific idiosyncracies.
Not only is the syntax the same, the semantics are too!

You can simply _drop_ duckplyr into your pipeline by loading it, then computations will be efficiently carried out by DuckDB.

```{r}
library(conflicted)
library(duckplyr)
conflict_prefer("filter", "dplyr", quiet = TRUE)
library("babynames")


out <- babynames |>
filter(n > 1000) |>
summarize(
.by = c(sex, year),
babies_n = sum(n)
) |>
filter(sex == "F")
class(out)

```

The very tagline of duckplyr, being a drop-in replacement for dplyr that uses DuckDB for speed, creates a tension:

- When using dplyr, we are not used to explicitly collect results, we simply access them: the data.frames are "eager" by default.
Adding a `collect()` step by default would confuse users and make "drop-in replacement" an exaggeration.
The collection of results, called materialization, has to be automatic by default.
Therefore, _duckplyr needs eagerness_!

- The whole advantage of using DuckDB under the hood is letting DuckDB optimize computations, like dtplyr does with data.table.
_Therefore, duckplyr needs laziness_!

As a consequence, duckplyr is lazy on the inside for all DuckDB operations but eager on the outside, thanks to [ALTREP](https://duckdb.org/2024/04/02/duckplyr.html#eager-vs-lazy-materialization), a powerful R feature that among other things supports *deferred evaluation*.

> "ALTREP allows R objects to have different in-memory representations, and for custom code to be executed whenever those objects are accessed." Hannes Mühleisen.

If the duckplyr data.frame is accessed by...

- duckplyr, then the operations continue to be lazy (until a call to `collect.duckplyr_df()` for instance).
- not duckplyr (say, ggplot2, or `nrow()`), then a special callback is executed, allowing materialization of the data frame.

Therefore, duckplyr can be both *lazy* (within itself) and *not lazy* (for the outside world).

Now, the default automatic materialization can be problematic if dealing with large data: what if the materialization eats up all memory?
Therefore, the duckplyr package has a safeguard called `prudence` with three levels.

- `"lavish"`: automatically materialize _regardless of size_,

```{r}
out <- babynames |>
duckdb_tibble(prudence = "lavish") |> # default value of prudence :-)
filter(n > 1000) |>
summarize(
.by = c(sex, year),
babies_n = sum(n)
) |>
filter(sex == "F")

class(out)
nrow(out)
```

- `"stingy"`: _never_ automatically materialize,

```{r, error = TRUE}
stingy <- babynames |>
duckdb_tibble(prudence = "stingy") |> # like the famous duck Uncle Scrooge :-)
filter(n > 1500) |>
summarize(
.by = c(sex, year),
babies_n = sum(n)
) |>
filter(sex == "F")

class(stingy)
nrow(stingy)
```

- `"thrifty"`: automatically materialize _up to a maximum size of 1 million cells_.

```{r, error = TRUE}
thrifty <- babynames |>
duckdb_tibble(prudence = "stingy") |>
filter(n > 1000) |>
summarize(
.by = c(sex, year),
babies_n = sum(n)
) |>
filter(sex == "F")

class(thrifty)
nrow(thrifty)
```

By default, duckplyr data frames are _lavish_, but duckplyr data frames created from Parquet data (presumedly large) are _thrifty_.

## How to use duckplyr

To _replace_ dplyr with duckplyr, you can either

- load duckplyr and then keep your pipeline as is. Calling `library(duckplyr)` overwrites dplyr methods, enabling duckplyr for the entire session no matter how data.frames are created.

```{r}
library(conflicted)
library(duckplyr)
conflict_prefer("filter", "dplyr", quiet = TRUE)
```

- Create individual "duck frames" which allows you to control their automatic materialization parameters. To do so, you can use _conversion functions_ like `duckdb_tibble()` or `as_duckdb_tibble()`, or _ingestion functions_ like `read_csv_duckdb()`.

Then, the data manipulation pipeline uses the exact same syntax as a dplyr pipeline.
The duckplyr package performs the computation using DuckDB.

```{r}
library("babynames")
out <- babynames |>
filter(n > 1000) |>
summarize(
.by = c(sex, year),
babies_n = sum(n)
) |>
filter(sex == "F")
```

The result can finally be materialized to memory, or computed temporarily, or computed to a file.

```{r}
# to memory
out

# to a file
csv_file <- withr::local_tempfile()
file.size(csv_file)
compute_csv(out, csv_file)
file.size(csv_file)
```

When duckplyr itself does not support specific functionality, it falls back to dplyr.
For instance, row names are not supported yet:

```{r}
mtcars |>
summarize(
.by = cyl,
disp = mean(disp, na.rm = TRUE),
sd = sd(disp, na.rm = TRUE)
)
```

Current limitations are documented in a vignette.
You can change the verbosity of fallbacks, refer to [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html).

### For large data

For large data, duckplyr is a worthy alternative to dtplyr and dbplyr.

With large datasets, you want:

- input data in an efficient format, like Parquet files. Therefore you might input data using `read_parquet_duckdb()`.
- efficient computation, which duckplyr provides via DuckDB's holistic optimization, without your having to use another syntax than dplyr.
- the output to not clutter all the memory. Therefore you can make use of these features:
- the `prudence` parameter, to disable automatic materialization completely or to disable automatic materialization up to a certain output size.
- computation to files using `compute_parquet()` or `compute_csv()`.

A drawback of analyzing large data with duckplyr is that the limitations of duckplyr won't be compensated by fallbacks since fallbacks to dplyr necessitate putting data into memory.
Therefore, if your pipeline encounters fallbacks, you might want to workaround them by converting the duck frame into a table through `compute()` then running SQL code through the experimental `read_sql_duckdb()` function.

```{r}
data <-
duckdb_tibble(a = 2) |>
mutate(b = 3)

computed_data <-
data |>
compute(name = "computed_data")

sql_data <-
read_sql_duckdb("SELECT *, a * b AS c FROM computed_data")

sql_data
```

## Help us improve duckplyr!

Our goals for future development of duckplyr include:

- Enabling users to provide [custom translations](https://github.com/tidyverse/duckplyr/issues/158) of dplyr functionality;
- Making it easier to contribute code to duckplyr.

You can already help though, in three main ways:

- Please report any issue especially regarding unknown incompabilities. See [`vignette("limits")`](https://duckplyr.tidyverse.org/articles/limits.html).
- Contribute to the codebase after reading duckplyr's [contributing guide](https://duckplyr.tidyverse.org/CONTRIBUTING.html).
- Turn on telemetry to help us hear about the most frequent fallbacks so we can prioritize working on the corresponding missing dplyr translation. See [`vignette("telemetry")`](https://duckplyr.tidyverse.org/articles/telemetry.html) and the [`duckplyr::fallback_sitrep()`](https://duckplyr.tidyverse.org/reference/fallback.html) function.

## Acknowledgements

A big thanks to all 54 folks who filed issues, created PRs and generally helped to improve duckplyr!

[&#x0040;adamschwing](https://github.com/adamschwing), [&#x0040;andreranza](https://github.com/andreranza), [&#x0040;apalacio9502](https://github.com/apalacio9502), [&#x0040;apsteinmetz](https://github.com/apsteinmetz), [&#x0040;barracuda156](https://github.com/barracuda156), [&#x0040;beniaminogreen](https://github.com/beniaminogreen), [&#x0040;bob-rietveld](https://github.com/bob-rietveld), [&#x0040;brichards920](https://github.com/brichards920), [&#x0040;cboettig](https://github.com/cboettig), [&#x0040;davidjayjackson](https://github.com/davidjayjackson), [&#x0040;DavisVaughan](https://github.com/DavisVaughan), [&#x0040;Ed2uiz](https://github.com/Ed2uiz), [&#x0040;eitsupi](https://github.com/eitsupi), [&#x0040;era127](https://github.com/era127), [&#x0040;etiennebacher](https://github.com/etiennebacher), [&#x0040;eutwt](https://github.com/eutwt), [&#x0040;fmichonneau](https://github.com/fmichonneau), [&#x0040;github-actions[bot]](https://github.com/github-actions[bot]), [&#x0040;hadley](https://github.com/hadley), [&#x0040;hannes](https://github.com/hannes), [&#x0040;hawkfish](https://github.com/hawkfish), [&#x0040;IndrajeetPatil](https://github.com/IndrajeetPatil), [&#x0040;JanSulavik](https://github.com/JanSulavik), [&#x0040;JavOrraca](https://github.com/JavOrraca), [&#x0040;jeroen](https://github.com/jeroen), [&#x0040;jhk0530](https://github.com/jhk0530), [&#x0040;joakimlinde](https://github.com/joakimlinde), [&#x0040;JosiahParry](https://github.com/JosiahParry), [&#x0040;krlmlr](https://github.com/krlmlr), [&#x0040;larry77](https://github.com/larry77), [&#x0040;lnkuiper](https://github.com/lnkuiper), [&#x0040;lorenzwalthert](https://github.com/lorenzwalthert), [&#x0040;luisDVA](https://github.com/luisDVA), [&#x0040;maelle](https://github.com/maelle), [&#x0040;math-mcshane](https://github.com/math-mcshane), [&#x0040;meersel](https://github.com/meersel), [&#x0040;multimeric](https://github.com/multimeric), [&#x0040;mytarmail](https://github.com/mytarmail), [&#x0040;nicki-dese](https://github.com/nicki-dese), [&#x0040;PMassicotte](https://github.com/PMassicotte), [&#x0040;prasundutta87](https://github.com/prasundutta87), [&#x0040;rafapereirabr](https://github.com/rafapereirabr), [&#x0040;Robinlovelace](https://github.com/Robinlovelace), [&#x0040;romainfrancois](https://github.com/romainfrancois), [&#x0040;sparrow925](https://github.com/sparrow925), [&#x0040;stefanlinner](https://github.com/stefanlinner), [&#x0040;thomasp85](https://github.com/thomasp85), [&#x0040;TimTaylor](https://github.com/TimTaylor), [&#x0040;Tmonster](https://github.com/Tmonster), [&#x0040;toppyy](https://github.com/toppyy), [&#x0040;wibeasley](https://github.com/wibeasley), [&#x0040;yjunechoe](https://github.com/yjunechoe), [&#x0040;ywhcuhk](https://github.com/ywhcuhk), and [&#x0040;zhjx19](https://github.com/zhjx19).
Loading