nbc
diff --git a/‎_freeze/posts/2025-07-10-benchmark/index.en/execute-results/html.json‎
Lines changed: 17 additions & 0 deletions b/‎_freeze/posts/2025-07-10-benchmark/index.en/execute-results/html.json‎
Lines changed: 17 additions & 0 deletions
diff --git a/‎_freeze/posts/2025-07-10-benchmark/index.en/figure-html/timemoir-1.png‎
26.9 KB b/‎_freeze/posts/2025-07-10-benchmark/index.en/figure-html/timemoir-1.png‎
26.9 KB
diff --git a/‎_freeze/posts/2025-07-10-st_as_sf/index.en/execute-results/html.json‎
Lines changed: 15 additions & 0 deletions b/‎_freeze/posts/2025-07-10-st_as_sf/index.en/execute-results/html.json‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎_freeze/posts/2025-07-10-st_as_sf/index.en/figure-html/output_benchmark-1.png‎
31.6 KB b/‎_freeze/posts/2025-07-10-st_as_sf/index.en/figure-html/output_benchmark-1.png‎
31.6 KB
diff --git a/‎_freeze/posts/2025-07-10-to_arrow_bad/index.en/execute-results/html.json‎
Lines changed: 15 additions & 0 deletions b/‎_freeze/posts/2025-07-10-to_arrow_bad/index.en/execute-results/html.json‎
Lines changed: 15 additions & 0 deletions
diff --git a/‎_freeze/posts/2025-07-10-to_arrow_bad/index.en/figure-html/output_benchmark-1.png‎
29.5 KB b/‎_freeze/posts/2025-07-10-to_arrow_bad/index.en/figure-html/output_benchmark-1.png‎
29.5 KB
@@ -0,0 +1,17 @@
+{
+  "hash": "efb84892311da939974f01581bdcaf9a",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Comparing DuckDB/Arrow Performance\"\ndescription: \"How to evaluate memory and CPU usage for long-running processes in duckdb/arrow\"\nlang: en\ndate: 2025-07-10\ncategories: [timemoir, benchmark]\nimage: https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg\ndraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!\n---\n\nWhen it comes to comparing different approaches, the ideal scenario is to run the code in benchmarking tools, but the \"classic\" R tools are not well suited for comparing `duckdb` and/or `arrow` code:\n\n- `tictoc` only returns elapsed time\n- `bench` does not detect memory allocations from duckdb and arrow\n- ...\n\nIn my articles, I will regularly use [timemoir](https://github.com/nbc/timemoir), written specifically for this type of comparison:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(timemoir)\n\ntest_function <- function(n) {\n  x <- rnorm(n); mean(x)\n}\n\nres <- timemoir(\n  test_function(1.2e7),\n  test_function(4e7),\n  test_function(1e8)\n)\n```\n\n```{.r .cell-code}\nres |> \n  kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname                  | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:----------------------|--------:|:-----|---------:|-------:|--------:|-------:|\n|test_function(1.2e+07) |    1.823|NA    |    110012|  204736|    1.455|   0.137|\n|test_function(4e+07)   |    4.600|NA    |    109296|  423636|    3.996|   0.276|\n|test_function(1e+08)   |    9.564|NA    |    109232|  892384|    9.065|   0.495|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/timemoir-1.png){width=672}\n:::\n:::\n\n\n---\n\nThat said, these are not \"true\" rigorous benchmarks—well beyond the scope of this blog, but rather quick comparisons intended to provide a rough idea of relative performance.\n\n::: {.callout-note collapse=true}\n## Session Information\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.5.0 (2025-04-11)\n os       Ubuntu 22.04.5 LTS\n system   x86_64, linux-gnu\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       Etc/UTC\n date     2025-08-09\n pandoc   3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto   1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package  * version    date (UTC) lib source\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
+    "supporting": [
+      "index.en_files"
+    ],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
@@ -0,0 +1,15 @@
+{
+  "hash": "7e8479c9c38cc2f9a515264ad529b122",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"From duckdb to st_to_sf\"\ndescription: \"How to convert a duckdb extraction into an `sf` object\"\nlang: en\ndate: 2025-07-12\ncategories: [duckdb, arrow, sf, geoarrow]\nimage: https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg\ndraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!\n---\n\nUntil recently, generating an SF dataframe from a duckdb query required:\n\n1. Using `ST_AsWKB` or `ST_AsText` on the geometry column  \n2. Materializing the data to transfer it to `sf::st_as_sf`\n\nWith recent versions of duckdb, the spatial extension, and the geoarrow package, you can now ask duckdb to produce data that can be directly reused by `geoarrow`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(geoarrow)\nlibrary(duckdb)\nlibrary(sf)\n\ncon <- dbConnect(duckdb())\n\nurl <- \"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\"\n\nx <- dbExecute(con, \"LOAD spatial;\")\nx <- dbExecute(con, \"LOAD httpfs;\")\nx <- dbExecute(con, \"CALL register_geoarrow_extensions()\")        # <1>\n\ndplyr::tbl(con, dplyr::sql(glue::glue(\"SELECT geometry \n                                       FROM read_parquet('{url}')\n                                       LIMIT 5\"))) |>             # <2>\n  arrow::to_arrow() |>                                            # <3>\n  st_as_sf(crs=st_crs(2154))                                      # <4>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSimple feature collection with 5 features and 0 fields\nGeometry type: POINT\nDimension:     XY\nBounding box:  xmin: 3.735375 ymin: 49.38698 xmax: 3.738175 ymax: 49.39506\nProjected CRS: RGF93 v1 / Lambert-93\n                   geometry\n1 POINT (3.738175 49.39245)\n2 POINT (3.735375 49.38829)\n3 POINT (3.735446 49.39507)\n4 POINT (3.738132 49.38698)\n5 POINT (3.735748 49.38712)\n```\n\n\n:::\n:::\n\n\n1. Instructs duckdb spatial to add geoarrow metadata to geometry-type columns  \n2. Thanks to the previous command, this line will return geometries readable by geoarrow  \n3. This line converts the object into an arrow object  \n4. geoarrow overrides the `st_as_sf` function so it can directly read the arrow object  \n\n## A quick comparison\n\nAnd it’s **much** faster than all other methods:\n\n\n::: {.cell}\n\n```{.r .cell-code  code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(arrow)\nlibrary(duckdb)\nlibrary(sf)\nlibrary(dplyr)\nlibrary(glue)\nlibrary(timemoir)\nlibrary(geoarrow)\n\nsample_size <- 1e8\n\nif (!file.exists(\"geo.parquet\")) {\n  download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\nwith_register_geoarrow <- function() {\n  conn_ddb <- dbConnect(duckdb())\n  dbExecute(conn_ddb, \"LOAD spatial;\")\n  dbExecute(conn_ddb, \"CALL register_geoarrow_extensions()\")\n  \n  query <- dplyr::tbl(conn_ddb, sql(glue(\"SELECT * FROM read_parquet('geo.parquet') LIMIT {sample_size}\"))) |>\n    arrow::to_arrow() |>\n    st_as_sf(crs=st_crs(2154))\n  \n  dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_st_read <- function() {\n  conn_ddb <- dbConnect(duckdb())\n  on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n  dbExecute(conn_ddb, \"LOAD spatial;\")\n  \n  a <- st_read(\n    conn_ddb, \n    query=glue(\n      \"SELECT * REPLACE(geometry.ST_ASWKB() AS geometry) FROM read_parquet('geo.parquet') \n      WHERE geometry IS NOT NULL LIMIT {sample_size}\"\n    ), \n    geometry_column = \"geometry\") |>\n    st_set_crs(2154)\n  dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_aswkb <- function() {\n  conn_ddb <- dbConnect(duckdb())\n  on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n  dbExecute(conn_ddb, \"LOAD spatial;\")\n  \n  query <- dbGetQuery(\n    conn_ddb, \n    glue(\n      \"\n      SELECT * REPLACE(geometry.ST_AsWKB() AS geometry) FROM read_parquet('geo.parquet') \n      WHERE geometry IS NOT NULL LIMIT {sample_size}\n      \"\n    )\n  ) |>\n    sf::st_as_sf(crs = st_crs(2154))\n  dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_astxt <- function() {\n  conn_ddb <- dbConnect(duckdb())\n  on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n  dbExecute(conn_ddb, \"LOAD spatial;\")\n  \n  query <- dbGetQuery(\n    conn_ddb, \n    glue(\n      \"\n      SELECT * REPLACE(geometry.ST_AsText() AS geometry) FROM read_parquet('geo.parquet')\n      WHERE geometry IS NOT NULL LIMIT {sample_size}\n      \"\n    )\n  ) |>\n    sf::st_as_sf(wkt = \"geometry\", crs = st_crs(2154))\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n  with_register_geoarrow(), \n  with_st_read(),\n  with_get_query_aswkb(),\n  with_get_query_astxt())\n```\n\n```{.r .cell-code}\nres |>\n  kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname                    | duration|error | start_mem|  max_mem| cpu_user| cpu_sys|\n|:------------------------|--------:|:-----|---------:|--------:|--------:|-------:|\n|with_register_geoarrow() |  102.676|NA    |    256252| 26895240|   85.800|  16.457|\n|with_st_read()           |  552.807|NA    |    257016| 25266148|  496.485|  54.894|\n|with_get_query_aswkb()   |  593.295|NA    |    285444| 25291172|  552.240|  77.263|\n|with_get_query_astxt()   |  450.991|NA    |    286640| 24853872|  426.793|  63.129|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/output_benchmark-1.png){width=672}\n:::\n:::\n\n\n## Some useful links\n\nThere isn’t much documentation about this command:\n\n* [A webinar from the R Consortium](https://youtu.be/tjNEoIYr_ag?t=1641)  \n* [A geoarrow issue](https://github.com/duckdb/duckdb-spatial/issues/589)\n\n::: {.callout-note collapse=true}\n## Session Information\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.5.0 (2025-04-11)\n os       Ubuntu 22.04.5 LTS\n system   x86_64, linux-gnu\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       Etc/UTC\n date     2025-08-09\n pandoc   3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto   1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package  * version    date (UTC) lib source\n arrow    * 20.0.0.2   2025-05-26 [1] RSPM (R 4.5.0)\n DBI      * 1.2.3      2024-06-02 [1] RSPM (R 4.5.0)\n dplyr    * 1.1.4      2023-11-17 [1] RSPM (R 4.5.0)\n duckdb   * 1.3.0      2025-06-02 [1] RSPM (R 4.5.0)\n geoarrow * 0.3.0      2025-05-26 [1] RSPM (R 4.5.0)\n glue     * 1.8.0      2024-09-30 [1] RSPM (R 4.5.0)\n sf       * 1.0-21     2025-05-15 [1] RSPM (R 4.5.0)\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}
@@ -0,0 +1,15 @@
+{
+  "hash": "eab4040606e112968db1a6e823d43ec5",
+  "result": {
+    "engine": "knitr",
+    "markdown": "---\ntitle: \"Comparison between arrow::to_arrow() and duckplyr for writing Parquet files\"\ndescription: \"Why You Should Avoid arrow::to_arrow() with DuckDB + dplyr\"\nlang: en\ndate: 2025-07-11\ncategories: [duckdb, arrow]\nimage: https://duckplyr.tidyverse.org/logo.png \ndraft: false\n---\n\nA commonly recommended approach to write a Parquet file after using `dplyr::tbl` with `duckdb` is to use `arrow::to_arrow` with `arrow::write_dataset` or `arrow::write_parquet`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ntbl(con, \"read_parquet('geo.parquet')\") |>\n  ...\n  arrow::to_arrow() |>\n  arrow::write_dataset(\"my_dataset\")\n```\n:::\n\n\nWhile this syntax works, the new [duckplyr](https://duckplyr.tidyverse.org/index.html) package offers a **much more efficient** alternative:\n\n\n::: {.cell}\n\n```{.r .cell-code}\ncon <- dbConnect(duckdb())\n\ntbl(con, \"read_parquet('geo.parquet')\") |>\n  ...\n  duckplyr::as_duckdb_tibble() |> # <1>\n  duckplyr::compute_parquet(\"my_tbl.parquet\") # <2>\n```\n:::\n\n\n1. [`duckplyr::as_duckdb_tibble`](https://duckplyr.tidyverse.org/reference/duckdb_tibble.html) converts the object returned by `tbl()` into a `duckplyr` objet\n2. [`duckplyr::compute_parquet`](https://duckplyr.tidyverse.org/reference/compute_parquet.html) writes the Parquet file\n\nThese two lines achieve the same result as the Arrow version, but using `duckplyr` is **much more efficient**.\n\n## A Quick Benchmark\n\nHere are the results from benchmarking three common methods (with full reproducible code below):\n\n- `with_arrow`: using `arrow::to_arrow()` + `write_dataset()`\n- `with_duckplyr`: using `duckplyr::as_duckdb_tibble()` + `compute_parquet()`\n- `with_copy_to`: using DuckDB’s native `COPY ... TO ...` as a baseline\n\n\n::: {.cell}\n\n```{.r .cell-code  code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(duckdb)\nlibrary(dplyr)\nlibrary(arrow)\nlibrary(kableExtra)\nlibrary(timemoir)\n\nif (!file.exists(\"geo.parquet\")) {\n  download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\n# Full DuckDB method\nwith_copy_to <- function() {\n  con <- dbConnect(duckdb())\n  on.exit(dbDisconnect(con, shutdown = TRUE))\n\n  dbExecute(con, \"COPY (FROM read_parquet('geo.parquet')) TO 'test.parquet' (FORMAT PARQUET, COMPRESSION ZSTD)\")\n}\n\n# \"Historical\" version with Arrow\nwith_arrow <- function() {\n  con <- dbConnect(duckdb())\n  on.exit(dbDisconnect(con, shutdown = TRUE))\n\n  tbl(con, \"read_parquet('geo.parquet')\") |>\n    arrow::to_arrow() |>\n    arrow::write_dataset('test', compression='zstd')\n}\n\n# Version using the new duckplyr package\nwith_duckplyr <- function() {\n  con <- dbConnect(duckdb())\n  on.exit(dbDisconnect(con, shutdown = TRUE))\n\n  tbl(con, \"read_parquet('geo.parquet')\") |>\n    duckplyr::as_duckdb_tibble() |>\n    duckplyr::compute_parquet(\"my_tbl.parquet\")\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n  with_arrow(), \n  with_copy_to(), \n  with_duckplyr()\n)\n```\n\n```{.r .cell-code}\nres |>\n  kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname           | duration|error                                 | start_mem|  max_mem| cpu_user| cpu_sys|\n|:---------------|--------:|:-------------------------------------|---------:|--------:|--------:|-------:|\n|with_arrow()    |  132.820|NA                                    |    155100| 21481404|  159.980|  50.707|\n|with_copy_to()  |   32.043|NA                                    |    162624| 12917092|  171.558|  70.419|\n|with_duckplyr() |       NA|there is no package called 'duckplyr' |        NA|       NA|       NA|      NA|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output .cell-output-stderr}\n\n```\nWarning: Removed 3 rows containing missing values or values outside the scale range\n(`geom_col()`).\n```\n\n\n:::\n\n::: {.cell-output-display}\n![](index.en_files/figure-html/output_benchmark-1.png){width=672}\n:::\n:::\n\n\n---\n\nOn the server I use, the `duckplyr` version is **6× faster** than the `arrow` version and uses **half the memory**, performing on par with pure DuckDB.\n\n## Conclusion\n\nIf you're working with `dplyr`, stop using `to_arrow()` and switch to `duckplyr` for better performance.\n\n## Useful Links\n\n- [duckplyr documentation](https://duckplyr.tidyverse.org/articles/large.html)\n\n---\n\n::: {.callout-note collapse=true}\n## Session Info\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting  value\n version  R version 4.5.0 (2025-04-11)\n os       Ubuntu 22.04.5 LTS\n system   x86_64, linux-gnu\n ui       X11\n language (EN)\n collate  en_US.UTF-8\n ctype    en_US.UTF-8\n tz       Etc/UTC\n date     2025-08-09\n pandoc   3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto   1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package    * version    date (UTC) lib source\n arrow      * 20.0.0.2   2025-05-26 [1] RSPM (R 4.5.0)\n DBI        * 1.2.3      2024-06-02 [1] RSPM (R 4.5.0)\n dplyr      * 1.1.4      2023-11-17 [1] RSPM (R 4.5.0)\n duckdb     * 1.3.0      2025-06-02 [1] RSPM (R 4.5.0)\n kableExtra * 1.4.0      2024-01-24 [1] RSPM (R 4.5.0)\n timemoir   * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
+    "supporting": [],
+    "filters": [
+      "rmarkdown/pagebreak.lua"
+    ],
+    "includes": {},
+    "engineDependencies": {},
+    "preserve": {},
+    "postProcess": true
+  }
+}