Skip to content

Commit 05e07ba

Browse files
committed
Génération du site
1 parent c57e463 commit 05e07ba

File tree

67 files changed

+15080
-786
lines changed

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

67 files changed

+15080
-786
lines changed
Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
{
2+
"hash": "efb84892311da939974f01581bdcaf9a",
3+
"result": {
4+
"engine": "knitr",
5+
"markdown": "---\ntitle: \"Comparing DuckDB/Arrow Performance\"\ndescription: \"How to evaluate memory and CPU usage for long-running processes in duckdb/arrow\"\nlang: en\ndate: 2025-07-10\ncategories: [timemoir, benchmark]\nimage: https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg\ndraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!\n---\n\nWhen it comes to comparing different approaches, the ideal scenario is to run the code in benchmarking tools, but the \"classic\" R tools are not well suited for comparing `duckdb` and/or `arrow` code:\n\n- `tictoc` only returns elapsed time\n- `bench` does not detect memory allocations from duckdb and arrow\n- ...\n\nIn my articles, I will regularly use [timemoir](https://github.com/nbc/timemoir), written specifically for this type of comparison:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(timemoir)\n\ntest_function <- function(n) {\n x <- rnorm(n); mean(x)\n}\n\nres <- timemoir(\n test_function(1.2e7),\n test_function(4e7),\n test_function(1e8)\n)\n```\n\n```{.r .cell-code}\nres |> \n kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:----------------------|--------:|:-----|---------:|-------:|--------:|-------:|\n|test_function(1.2e+07) | 1.823|NA | 110012| 204736| 1.455| 0.137|\n|test_function(4e+07) | 4.600|NA | 109296| 423636| 3.996| 0.276|\n|test_function(1e+08) | 9.564|NA | 109232| 892384| 9.065| 0.495|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/timemoir-1.png){width=672}\n:::\n:::\n\n\n---\n\nThat said, these are not \"true\" rigorous benchmarks—well beyond the scope of this blog, but rather quick comparisons intended to provide a rough idea of relative performance.\n\n::: {.callout-note collapse=true}\n## Session Information\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.5.0 (2025-04-11)\n os Ubuntu 22.04.5 LTS\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz Etc/UTC\n date 2025-08-09\n pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto 1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
6+
"supporting": [
7+
"index.en_files"
8+
],
9+
"filters": [
10+
"rmarkdown/pagebreak.lua"
11+
],
12+
"includes": {},
13+
"engineDependencies": {},
14+
"preserve": {},
15+
"postProcess": true
16+
}
17+
}
26.9 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"hash": "7e8479c9c38cc2f9a515264ad529b122",
3+
"result": {
4+
"engine": "knitr",
5+
"markdown": "---\ntitle: \"From duckdb to st_to_sf\"\ndescription: \"How to convert a duckdb extraction into an `sf` object\"\nlang: en\ndate: 2025-07-12\ncategories: [duckdb, arrow, sf, geoarrow]\nimage: https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg\ndraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!\n---\n\nUntil recently, generating an SF dataframe from a duckdb query required:\n\n1. Using `ST_AsWKB` or `ST_AsText` on the geometry column \n2. Materializing the data to transfer it to `sf::st_as_sf`\n\nWith recent versions of duckdb, the spatial extension, and the geoarrow package, you can now ask duckdb to produce data that can be directly reused by `geoarrow`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(geoarrow)\nlibrary(duckdb)\nlibrary(sf)\n\ncon <- dbConnect(duckdb())\n\nurl <- \"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\"\n\nx <- dbExecute(con, \"LOAD spatial;\")\nx <- dbExecute(con, \"LOAD httpfs;\")\nx <- dbExecute(con, \"CALL register_geoarrow_extensions()\") # <1>\n\ndplyr::tbl(con, dplyr::sql(glue::glue(\"SELECT geometry \n FROM read_parquet('{url}')\n LIMIT 5\"))) |> # <2>\n arrow::to_arrow() |> # <3>\n st_as_sf(crs=st_crs(2154)) # <4>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSimple feature collection with 5 features and 0 fields\nGeometry type: POINT\nDimension: XY\nBounding box: xmin: 3.735375 ymin: 49.38698 xmax: 3.738175 ymax: 49.39506\nProjected CRS: RGF93 v1 / Lambert-93\n geometry\n1 POINT (3.738175 49.39245)\n2 POINT (3.735375 49.38829)\n3 POINT (3.735446 49.39507)\n4 POINT (3.738132 49.38698)\n5 POINT (3.735748 49.38712)\n```\n\n\n:::\n:::\n\n\n1. Instructs duckdb spatial to add geoarrow metadata to geometry-type columns \n2. Thanks to the previous command, this line will return geometries readable by geoarrow \n3. This line converts the object into an arrow object \n4. geoarrow overrides the `st_as_sf` function so it can directly read the arrow object \n\n## A quick comparison\n\nAnd it’s **much** faster than all other methods:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(arrow)\nlibrary(duckdb)\nlibrary(sf)\nlibrary(dplyr)\nlibrary(glue)\nlibrary(timemoir)\nlibrary(geoarrow)\n\nsample_size <- 1e8\n\nif (!file.exists(\"geo.parquet\")) {\n download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\nwith_register_geoarrow <- function() {\n conn_ddb <- dbConnect(duckdb())\n dbExecute(conn_ddb, \"LOAD spatial;\")\n dbExecute(conn_ddb, \"CALL register_geoarrow_extensions()\")\n \n query <- dplyr::tbl(conn_ddb, sql(glue(\"SELECT * FROM read_parquet('geo.parquet') LIMIT {sample_size}\"))) |>\n arrow::to_arrow() |>\n st_as_sf(crs=st_crs(2154))\n \n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_st_read <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n a <- st_read(\n conn_ddb, \n query=glue(\n \"SELECT * REPLACE(geometry.ST_ASWKB() AS geometry) FROM read_parquet('geo.parquet') \n WHERE geometry IS NOT NULL LIMIT {sample_size}\"\n ), \n geometry_column = \"geometry\") |>\n st_set_crs(2154)\n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_aswkb <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n query <- dbGetQuery(\n conn_ddb, \n glue(\n \"\n SELECT * REPLACE(geometry.ST_AsWKB() AS geometry) FROM read_parquet('geo.parquet') \n WHERE geometry IS NOT NULL LIMIT {sample_size}\n \"\n )\n ) |>\n sf::st_as_sf(crs = st_crs(2154))\n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_astxt <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n query <- dbGetQuery(\n conn_ddb, \n glue(\n \"\n SELECT * REPLACE(geometry.ST_AsText() AS geometry) FROM read_parquet('geo.parquet')\n WHERE geometry IS NOT NULL LIMIT {sample_size}\n \"\n )\n ) |>\n sf::st_as_sf(wkt = \"geometry\", crs = st_crs(2154))\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n with_register_geoarrow(), \n with_st_read(),\n with_get_query_aswkb(),\n with_get_query_astxt())\n```\n\n```{.r .cell-code}\nres |>\n kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:------------------------|--------:|:-----|---------:|--------:|--------:|-------:|\n|with_register_geoarrow() | 102.676|NA | 256252| 26895240| 85.800| 16.457|\n|with_st_read() | 552.807|NA | 257016| 25266148| 496.485| 54.894|\n|with_get_query_aswkb() | 593.295|NA | 285444| 25291172| 552.240| 77.263|\n|with_get_query_astxt() | 450.991|NA | 286640| 24853872| 426.793| 63.129|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/output_benchmark-1.png){width=672}\n:::\n:::\n\n\n## Some useful links\n\nThere isn’t much documentation about this command:\n\n* [A webinar from the R Consortium](https://youtu.be/tjNEoIYr_ag?t=1641) \n* [A geoarrow issue](https://github.com/duckdb/duckdb-spatial/issues/589)\n\n::: {.callout-note collapse=true}\n## Session Information\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.5.0 (2025-04-11)\n os Ubuntu 22.04.5 LTS\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz Etc/UTC\n date 2025-08-09\n pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto 1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n arrow * 20.0.0.2 2025-05-26 [1] RSPM (R 4.5.0)\n DBI * 1.2.3 2024-06-02 [1] RSPM (R 4.5.0)\n dplyr * 1.1.4 2023-11-17 [1] RSPM (R 4.5.0)\n duckdb * 1.3.0 2025-06-02 [1] RSPM (R 4.5.0)\n geoarrow * 0.3.0 2025-05-26 [1] RSPM (R 4.5.0)\n glue * 1.8.0 2024-09-30 [1] RSPM (R 4.5.0)\n sf * 1.0-21 2025-05-15 [1] RSPM (R 4.5.0)\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
6+
"supporting": [],
7+
"filters": [
8+
"rmarkdown/pagebreak.lua"
9+
],
10+
"includes": {},
11+
"engineDependencies": {},
12+
"preserve": {},
13+
"postProcess": true
14+
}
15+
}
31.6 KB
Loading
Lines changed: 15 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,15 @@
1+
{
2+
"hash": "eab4040606e112968db1a6e823d43ec5",
3+
"result": {
4+
"engine": "knitr",
5+
"markdown": "---\ntitle: \"Comparison between arrow::to_arrow() and duckplyr for writing Parquet files\"\ndescription: \"Why You Should Avoid arrow::to_arrow() with DuckDB + dplyr\"\nlang: en\ndate: 2025-07-11\ncategories: [duckdb, arrow]\nimage: https://duckplyr.tidyverse.org/logo.png \ndraft: false\n---\n\nA commonly recommended approach to write a Parquet file after using `dplyr::tbl` with `duckdb` is to use `arrow::to_arrow` with `arrow::write_dataset` or `arrow::write_parquet`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(con, \"read_parquet('geo.parquet')\") |>\n ...\n arrow::to_arrow() |>\n arrow::write_dataset(\"my_dataset\")\n```\n:::\n\n\nWhile this syntax works, the new [duckplyr](https://duckplyr.tidyverse.org/index.html) package offers a **much more efficient** alternative:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncon <- dbConnect(duckdb())\n\ntbl(con, \"read_parquet('geo.parquet')\") |>\n ...\n duckplyr::as_duckdb_tibble() |> # <1>\n duckplyr::compute_parquet(\"my_tbl.parquet\") # <2>\n```\n:::\n\n\n1. [`duckplyr::as_duckdb_tibble`](https://duckplyr.tidyverse.org/reference/duckdb_tibble.html) converts the object returned by `tbl()` into a `duckplyr` objet\n2. [`duckplyr::compute_parquet`](https://duckplyr.tidyverse.org/reference/compute_parquet.html) writes the Parquet file\n\nThese two lines achieve the same result as the Arrow version, but using `duckplyr` is **much more efficient**.\n\n## A Quick Benchmark\n\nHere are the results from benchmarking three common methods (with full reproducible code below):\n\n- `with_arrow`: using `arrow::to_arrow()` + `write_dataset()`\n- `with_duckplyr`: using `duckplyr::as_duckdb_tibble()` + `compute_parquet()`\n- `with_copy_to`: using DuckDB’s native `COPY ... TO ...` as a baseline\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(duckdb)\nlibrary(dplyr)\nlibrary(arrow)\nlibrary(kableExtra)\nlibrary(timemoir)\n\nif (!file.exists(\"geo.parquet\")) {\n download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\n# Full DuckDB method\nwith_copy_to <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n dbExecute(con, \"COPY (FROM read_parquet('geo.parquet')) TO 'test.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)\")\n}\n\n# \"Historical\" version with Arrow\nwith_arrow <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n tbl(con, \"read_parquet('geo.parquet')\") |>\n arrow::to_arrow() |>\n arrow::write_dataset('test', compression='zstd')\n}\n\n# Version using the new duckplyr package\nwith_duckplyr <- function() {\n con <- dbConnect(duckdb())\n on.exit(dbDisconnect(con, shutdown = TRUE))\n\n tbl(con, \"read_parquet('geo.parquet')\") |>\n duckplyr::as_duckdb_tibble() |>\n duckplyr::compute_parquet(\"my_tbl.parquet\")\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n with_arrow(), \n with_copy_to(), \n with_duckplyr()\n)\n```\n\n```{.r .cell-code}\nres |>\n kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:---------------|--------:|:-------------------------------------|---------:|--------:|--------:|-------:|\n|with_arrow() | 132.820|NA | 155100| 21481404| 159.980| 50.707|\n|with_copy_to() | 32.043|NA | 162624| 12917092| 171.558| 70.419|\n|with_duckplyr() | NA|there is no package called 'duckplyr' | NA| NA| NA| NA|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 3 rows containing missing values or values outside the scale range\n(`geom_col()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/output_benchmark-1.png){width=672}\n:::\n:::\n\n\n---\n\nOn the server I use, the `duckplyr` version is **6× faster** than the `arrow` version and uses **half the memory**, performing on par with pure DuckDB.\n\n## Conclusion\n\nIf you're working with `dplyr`, stop using `to_arrow()` and switch to `duckplyr` for better performance.\n\n## Useful Links\n\n- [duckplyr documentation](https://duckplyr.tidyverse.org/articles/large.html)\n\n---\n\n::: {.callout-note collapse=true}\n## Session Info\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.5.0 (2025-04-11)\n os Ubuntu 22.04.5 LTS\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz Etc/UTC\n date 2025-08-09\n pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto 1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n arrow * 20.0.0.2 2025-05-26 [1] RSPM (R 4.5.0)\n DBI * 1.2.3 2024-06-02 [1] RSPM (R 4.5.0)\n dplyr * 1.1.4 2023-11-17 [1] RSPM (R 4.5.0)\n duckdb * 1.3.0 2025-06-02 [1] RSPM (R 4.5.0)\n kableExtra * 1.4.0 2024-01-24 [1] RSPM (R 4.5.0)\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
6+
"supporting": [],
7+
"filters": [
8+
"rmarkdown/pagebreak.lua"
9+
],
10+
"includes": {},
11+
"engineDependencies": {},
12+
"preserve": {},
13+
"postProcess": true
14+
}
15+
}
29.5 KB
Loading

0 commit comments

Comments
 (0)