Skip to content

Commit a16b3db

Browse files
committed
Generation
1 parent 2f5a61c commit a16b3db

File tree

20 files changed

+726
-76
lines changed

20 files changed

+726
-76
lines changed
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"hash": "fbc111072154bc9dce4c5c1be7347998",
3+
"result": {
4+
"engine": "knitr",
5+
"markdown": "---\ntitle: \"Comparison between arrow::to_arrow() and duckplyr for writing Parquet files\"\ndescription: \"Why You Should Avoid arrow::to_arrow() with DuckDB + dplyr\"\nlang: en\ndate: 2025-07-11\ncategories: [R, duckdb, arrow]\nimage: https://duckplyr.tidyverse.org/logo.png \ndraft: false\n---\n\nA commonly recommended approach to write a Parquet file after using `dplyr::tbl` with `duckdb` is to use `arrow::to_arrow` with `arrow::write_dataset` or `arrow::write_parquet`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(con, \"read_parquet('geo.parquet')\") |>\n ...\n arrow::to_arrow() |>\n arrow::write_dataset(\"my_dataset\")\n```\n:::\n\n\nWhile this syntax works, the new [duckplyr](https://duckplyr.tidyverse.org/index.html) package offers a **much more efficient** alternative:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncon <- dbConnect(duckdb())\n\ntbl(con, \"read_parquet('geo.parquet')\") |>\n ...\n duckplyr::as_duckdb_tibble() |> # <1>\n duckplyr::compute_parquet(\"my_tbl.parquet\") # <2>\n```\n:::\n\n\n1. [`duckplyr::as_duckdb_tibble`](https://duckplyr.tidyverse.org/reference/duckdb_tibble.html) converts the object returned by `tbl()` into a `duckplyr` objet\n2. [`duckplyr::compute_parquet`](https://duckplyr.tidyverse.org/reference/compute_parquet.html) writes the Parquet file\n\nThese two lines achieve the same result as the Arrow version, but using `duckplyr` is **much more efficient**.\n\n## A Quick Benchmark\n\nHere are the results from benchmarking three common methods (with full reproducible code below):\n\n- `with_arrow`: using `arrow::to_arrow()` + `write_dataset()`\n- `with_duckplyr`: using `duckplyr::as_duckdb_tibble()` + `compute_parquet()`\n- `with_copy_to`: using DuckDB’s native `COPY ... TO ...` as a baseline\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(duckdb)\nlibrary(dplyr)\nlibrary(arrow)\nlibrary(kableExtra)\nlibrary(timemoir)\n\nif (!file.exists(\"geo.parquet\")) {\n download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\n# Full DuckDB method\nwith_copy_to <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n dbExecute(con, \"COPY (FROM read_parquet('geo.parquet')) TO 'test.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)\")\n}\n\n# \"Historical\" version with Arrow\nwith_arrow <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n tbl(con, \"read_parquet('geo.parquet')\") |>\n arrow::to_arrow() |>\n arrow::write_dataset('test', compression='zstd')\n}\n\n# Version using the new duckplyr package\nwith_duckplyr <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n tbl(con, \"read_parquet('geo.parquet')\") |>\n duckplyr::as_duckdb_tibble() |>\n duckplyr::compute_parquet(\"my_tbl.parquet\")\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n with_arrow(), \n with_copy_to(), \n with_duckplyr()\n)\n```\n\n```{.r .cell-code}\nres |>\n kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:---------------|--------:|:-----|---------:|--------:|--------:|-------:|\n|with_arrow() | 125.123|NA | 153200| 21206428| 157.164| 50.598|\n|with_copy_to() | 28.969|NA | 158720| 11870840| 157.578| 58.525|\n|with_duckplyr() | 33.704|NA | 164088| 11933724| 128.338| 50.864|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/output_benchmark-1.png){width=672}\n:::\n:::\n\n\n---\n\nOn the server I use, the `duckplyr` version is **9× faster** than the `arrow` version and uses **half the memory**, performing on par with pure DuckDB (for this very simple test case).\n\n## Conclusion\n\nIf you're working with `dplyr`, stop using `to_arrow()` and switch to `duckplyr` for better performance.\n\n## Useful Links\n\n- [duckplyr documentation](https://duckplyr.tidyverse.org/articles/large.html)\n\n---\n\n::: {.callout-note collapse=true}\n## Session Info\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.5.0 (2025-04-11)\n os Ubuntu 22.04.5 LTS\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz Etc/UTC\n date 2025-08-09\n pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto 1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n arrow * 20.0.0.2 2025-05-26 [1] RSPM (R 4.5.0)\n DBI * 1.2.3 2024-06-02 [1] RSPM (R 4.5.0)\n dplyr * 1.1.4 2023-11-17 [1] RSPM (R 4.5.0)\n duckdb * 1.3.0 2025-06-02 [1] RSPM (R 4.5.0)\n kableExtra * 1.4.0 2024-01-24 [1] RSPM (R 4.5.0)\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
6+
"supporting": [],
7+
"filters": [
8+
"rmarkdown/pagebreak.lua"
9+
],
10+
"includes": {},
11+
"engineDependencies": {},
12+
"preserve": {},
13+
"postProcess": true
14+
}
15+
}
28.8 KB
Loading

_freeze/posts/2025-08-09-dplyr-to-duckdb/index/execute-results/html.json

Lines changed: 15 additions & 0 deletions
Large diffs are not rendered by default.

docs/en/posts-r.xml

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -970,30 +970,30 @@ font-style: inherit;">kable</span>()</span></code></pre></div>
970970
<tbody>
971971
<tr class="odd">
972972
<td style="text-align: left;">with_arrow()</td>
973-
<td style="text-align: right;">61.147</td>
973+
<td style="text-align: right;">125.123</td>
974974
<td style="text-align: left;">NA</td>
975-
<td style="text-align: right;">149664</td>
976-
<td style="text-align: right;">21835576</td>
977-
<td style="text-align: right;">76.045</td>
978-
<td style="text-align: right;">17.691</td>
975+
<td style="text-align: right;">153200</td>
976+
<td style="text-align: right;">21206428</td>
977+
<td style="text-align: right;">157.164</td>
978+
<td style="text-align: right;">50.598</td>
979979
</tr>
980980
<tr class="even">
981981
<td style="text-align: left;">with_copy_to()</td>
982-
<td style="text-align: right;">7.480</td>
982+
<td style="text-align: right;">28.969</td>
983983
<td style="text-align: left;">NA</td>
984-
<td style="text-align: right;">149104</td>
985-
<td style="text-align: right;">9096224</td>
986-
<td style="text-align: right;">66.407</td>
987-
<td style="text-align: right;">9.944</td>
984+
<td style="text-align: right;">158720</td>
985+
<td style="text-align: right;">11870840</td>
986+
<td style="text-align: right;">157.578</td>
987+
<td style="text-align: right;">58.525</td>
988988
</tr>
989989
<tr class="odd">
990990
<td style="text-align: left;">with_duckplyr()</td>
991-
<td style="text-align: right;">7.013</td>
991+
<td style="text-align: right;">33.704</td>
992992
<td style="text-align: left;">NA</td>
993-
<td style="text-align: right;">149104</td>
994-
<td style="text-align: right;">11818744</td>
995-
<td style="text-align: right;">54.990</td>
996-
<td style="text-align: right;">10.564</td>
993+
<td style="text-align: right;">164088</td>
994+
<td style="text-align: right;">11933724</td>
995+
<td style="text-align: right;">128.338</td>
996+
<td style="text-align: right;">50.864</td>
997997
</tr>
998998
</tbody>
999999
</table>
@@ -1004,7 +1004,7 @@ font-style: inherit;">plot</span>(res)</span></code></pre></div>
10041004
<div class="cell-output-display">
10051005
<div>
10061006
<figure class="figure">
1007-
<p><img src="https://nbc.github.io/en/posts/2025-07-10-to_arrow_bad/index_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
1007+
<p><img src="https://nbc.github.io/en/posts/2025-07-10-to_arrow_bad/index.en_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
10081008
</figure>
10091009
</div>
10101010
</div>

docs/en/posts.html

Lines changed: 3 additions & 3 deletions
Original file line numberDiff line numberDiff line change
@@ -184,7 +184,7 @@ <h1 class="title">Blog</h1>
184184

185185
</header><div class="quarto-listing quarto-listing-container-default" id="listing-listing">
186186
<div class="list quarto-listing-default">
187-
<div class="quarto-post image-right" data-index="0" data-categories="UiUyQ2R1Y2tkYiUyQ2Fycm93JTJDc2YlMkNnZW9hcnJvdw==" data-listing-date-sort="1752278400000" data-listing-file-modified-sort="1754756634779" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="690">
187+
<div class="quarto-post image-right" data-index="0" data-categories="UiUyQ2R1Y2tkYiUyQ2Fycm93JTJDc2YlMkNnZW9hcnJvdw==" data-listing-date-sort="1752278400000" data-listing-file-modified-sort="1754840644338" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="4" data-listing-word-count-sort="690">
188188
<div class="thumbnail"><a href="./posts/2025-07-10-st_as_sf/index.html" class="no-external">
189189

190190
<img loading="lazy" src="https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg" class="thumbnail-image"></a></div>
@@ -220,7 +220,7 @@ <h3 class="no-anchor listing-title">
220220
</a>
221221
</div>
222222
</div>
223-
<div class="quarto-post image-right" data-index="1" data-categories="UiUyQ2R1Y2tkYiUyQ2Fycm93" data-listing-date-sort="1752192000000" data-listing-file-modified-sort="1754756650135" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="2" data-listing-word-count-sort="319">
223+
<div class="quarto-post image-right" data-index="1" data-categories="UiUyQ2R1Y2tkYiUyQ2Fycm93" data-listing-date-sort="1752192000000" data-listing-file-modified-sort="1754840660739" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="2" data-listing-word-count-sort="327">
224224
<div class="thumbnail"><a href="./posts/2025-07-10-to_arrow_bad/index.html" class="no-external">
225225

226226
<img loading="lazy" src="https://duckplyr.tidyverse.org/logo.png" class="thumbnail-image"></a></div>
@@ -252,7 +252,7 @@ <h3 class="no-anchor listing-title">
252252
</a>
253253
</div>
254254
</div>
255-
<div class="quarto-post image-right" data-index="2" data-categories="UiUyQ2JlbmNobWFyaw==" data-listing-date-sort="1752105600000" data-listing-file-modified-sort="1754756620310" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="1" data-listing-word-count-sort="117">
255+
<div class="quarto-post image-right" data-index="2" data-categories="UiUyQ2JlbmNobWFyaw==" data-listing-date-sort="1752105600000" data-listing-file-modified-sort="1754840624562" data-listing-date-modified-sort="NaN" data-listing-reading-time-sort="1" data-listing-word-count-sort="117">
256256
<div class="thumbnail"><a href="./posts/2025-07-10-benchmark/index.html" class="no-external">
257257

258258
<img loading="lazy" src="https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg" class="thumbnail-image"></a></div>

docs/en/posts.xml

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -970,30 +970,30 @@ font-style: inherit;">kable</span>()</span></code></pre></div>
970970
<tbody>
971971
<tr class="odd">
972972
<td style="text-align: left;">with_arrow()</td>
973-
<td style="text-align: right;">61.147</td>
973+
<td style="text-align: right;">125.123</td>
974974
<td style="text-align: left;">NA</td>
975-
<td style="text-align: right;">149664</td>
976-
<td style="text-align: right;">21835576</td>
977-
<td style="text-align: right;">76.045</td>
978-
<td style="text-align: right;">17.691</td>
975+
<td style="text-align: right;">153200</td>
976+
<td style="text-align: right;">21206428</td>
977+
<td style="text-align: right;">157.164</td>
978+
<td style="text-align: right;">50.598</td>
979979
</tr>
980980
<tr class="even">
981981
<td style="text-align: left;">with_copy_to()</td>
982-
<td style="text-align: right;">7.480</td>
982+
<td style="text-align: right;">28.969</td>
983983
<td style="text-align: left;">NA</td>
984-
<td style="text-align: right;">149104</td>
985-
<td style="text-align: right;">9096224</td>
986-
<td style="text-align: right;">66.407</td>
987-
<td style="text-align: right;">9.944</td>
984+
<td style="text-align: right;">158720</td>
985+
<td style="text-align: right;">11870840</td>
986+
<td style="text-align: right;">157.578</td>
987+
<td style="text-align: right;">58.525</td>
988988
</tr>
989989
<tr class="odd">
990990
<td style="text-align: left;">with_duckplyr()</td>
991-
<td style="text-align: right;">7.013</td>
991+
<td style="text-align: right;">33.704</td>
992992
<td style="text-align: left;">NA</td>
993-
<td style="text-align: right;">149104</td>
994-
<td style="text-align: right;">11818744</td>
995-
<td style="text-align: right;">54.990</td>
996-
<td style="text-align: right;">10.564</td>
993+
<td style="text-align: right;">164088</td>
994+
<td style="text-align: right;">11933724</td>
995+
<td style="text-align: right;">128.338</td>
996+
<td style="text-align: right;">50.864</td>
997997
</tr>
998998
</tbody>
999999
</table>
@@ -1004,7 +1004,7 @@ font-style: inherit;">plot</span>(res)</span></code></pre></div>
10041004
<div class="cell-output-display">
10051005
<div>
10061006
<figure class="figure">
1007-
<p><img src="https://nbc.github.io/en/posts/2025-07-10-to_arrow_bad/index_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
1007+
<p><img src="https://nbc.github.io/en/posts/2025-07-10-to_arrow_bad/index.en_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
10081008
</figure>
10091009
</div>
10101010
</div>
28.8 KB
Loading

docs/en/posts/2025-07-10-to_arrow_bad/index.html

Lines changed: 16 additions & 16 deletions
Original file line numberDiff line numberDiff line change
@@ -290,38 +290,38 @@ <h1 class="title">Comparison between arrow::to_arrow() and duckplyr for writing
290290
<tbody>
291291
<tr class="odd">
292292
<td style="text-align: left;">with_arrow()</td>
293-
<td style="text-align: right;">61.147</td>
293+
<td style="text-align: right;">125.123</td>
294294
<td style="text-align: left;">NA</td>
295-
<td style="text-align: right;">149664</td>
296-
<td style="text-align: right;">21835576</td>
297-
<td style="text-align: right;">76.045</td>
298-
<td style="text-align: right;">17.691</td>
295+
<td style="text-align: right;">153200</td>
296+
<td style="text-align: right;">21206428</td>
297+
<td style="text-align: right;">157.164</td>
298+
<td style="text-align: right;">50.598</td>
299299
</tr>
300300
<tr class="even">
301301
<td style="text-align: left;">with_copy_to()</td>
302-
<td style="text-align: right;">7.480</td>
302+
<td style="text-align: right;">28.969</td>
303303
<td style="text-align: left;">NA</td>
304-
<td style="text-align: right;">149104</td>
305-
<td style="text-align: right;">9096224</td>
306-
<td style="text-align: right;">66.407</td>
307-
<td style="text-align: right;">9.944</td>
304+
<td style="text-align: right;">158720</td>
305+
<td style="text-align: right;">11870840</td>
306+
<td style="text-align: right;">157.578</td>
307+
<td style="text-align: right;">58.525</td>
308308
</tr>
309309
<tr class="odd">
310310
<td style="text-align: left;">with_duckplyr()</td>
311-
<td style="text-align: right;">7.013</td>
311+
<td style="text-align: right;">33.704</td>
312312
<td style="text-align: left;">NA</td>
313-
<td style="text-align: right;">149104</td>
314-
<td style="text-align: right;">11818744</td>
315-
<td style="text-align: right;">54.990</td>
316-
<td style="text-align: right;">10.564</td>
313+
<td style="text-align: right;">164088</td>
314+
<td style="text-align: right;">11933724</td>
315+
<td style="text-align: right;">128.338</td>
316+
<td style="text-align: right;">50.864</td>
317317
</tr>
318318
</tbody>
319319
</table>
320320
</div>
321321
<div class="sourceCode cell-code" id="cb5"><pre class="sourceCode r code-with-copy"><code class="sourceCode r"><span id="cb5-1"><a href="#cb5-1" aria-hidden="true" tabindex="-1"></a><span class="fu">plot</span>(res)</span></code><button title="Copy to Clipboard" class="code-copy-button"><i class="bi"></i></button></pre></div>
322322
<div class="cell-output-display">
323323
<div>
324-
<figure class="figure"><p><img src="index_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
324+
<figure class="figure"><p><img src="index.en_files/figure-html/output_benchmark-1.png" class="img-fluid figure-img" width="672"></p>
325325
</figure>
326326
</div>
327327
</div>
Binary file not shown.

0 commit comments

Comments
 (0)