+ "markdown": "---\ntitle: \"From duckdb to st_to_sf\"\ndescription: \"How to convert a duckdb extraction into an `sf` object\"\nlang: en\ndate: 2025-07-12\ncategories: [duckdb, arrow, sf, geoarrow]\nimage: https://duckdb.org/images/logo-dl/DuckDB_Logo-horizontal.svg\ndraft: false # setting this to `true` will prevent your post from appearing on your listing page until you're ready!\n---\n\nUntil recently, generating an SF dataframe from a duckdb query required:\n\n1. Using `ST_AsWKB` or `ST_AsText` on the geometry column \n2. Materializing the data to transfer it to `sf::st_as_sf`\n\nWith recent versions of duckdb, the spatial extension, and the geoarrow package, you can now ask duckdb to produce data that can be directly reused by `geoarrow`:\n\n\n::: {.cell}\n\n```{.r .cell-code}\nlibrary(geoarrow)\nlibrary(duckdb)\nlibrary(sf)\n\ncon <- dbConnect(duckdb())\n\nurl <- \"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\"\n\nx <- dbExecute(con, \"LOAD spatial;\")\nx <- dbExecute(con, \"LOAD httpfs;\")\nx <- dbExecute(con, \"CALL register_geoarrow_extensions()\") # <1>\n\ndplyr::tbl(con, dplyr::sql(glue::glue(\"SELECT geometry \n FROM read_parquet('{url}')\n LIMIT 5\"))) |> # <2>\n arrow::to_arrow() |> # <3>\n st_as_sf(crs=st_crs(2154)) # <4>\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\nSimple feature collection with 5 features and 0 fields\nGeometry type: POINT\nDimension: XY\nBounding box: xmin: 3.735375 ymin: 49.38698 xmax: 3.738175 ymax: 49.39506\nProjected CRS: RGF93 v1 / Lambert-93\n geometry\n1 POINT (3.738175 49.39245)\n2 POINT (3.735375 49.38829)\n3 POINT (3.735446 49.39507)\n4 POINT (3.738132 49.38698)\n5 POINT (3.735748 49.38712)\n```\n\n\n:::\n:::\n\n\n1. Instructs duckdb spatial to add geoarrow metadata to geometry-type columns \n2. Thanks to the previous command, this line will return geometries readable by geoarrow \n3. This line converts the object into an arrow object \n4. geoarrow overrides the `st_as_sf` function so it can directly read the arrow object \n\n## A quick comparison\n\nAnd it’s **much** faster than all other methods:\n\n\n::: {.cell}\n\n```{.r .cell-code code-fold=\"true\" code-summary=\"Show me the benchmark code\"}\nlibrary(arrow)\nlibrary(duckdb)\nlibrary(sf)\nlibrary(dplyr)\nlibrary(glue)\nlibrary(timemoir)\nlibrary(geoarrow)\n\nsample_size <- 1e8\n\nif (!file.exists(\"geo.parquet\")) {\n download.file(\"https://static.data.gouv.fr/resources/sirene-geolocalise-parquet/20240107-143656/sirene2024-geo.parquet\", \"geo.parquet\")\n}\n\nwith_register_geoarrow <- function() {\n conn_ddb <- dbConnect(duckdb())\n dbExecute(conn_ddb, \"LOAD spatial;\")\n dbExecute(conn_ddb, \"CALL register_geoarrow_extensions()\")\n \n query <- dplyr::tbl(conn_ddb, sql(glue(\"SELECT * FROM read_parquet('geo.parquet') LIMIT {sample_size}\"))) |>\n arrow::to_arrow() |>\n st_as_sf(crs=st_crs(2154))\n \n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_st_read <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n a <- st_read(\n conn_ddb, \n query=glue(\n \"SELECT * REPLACE(geometry.ST_ASWKB() AS geometry) FROM read_parquet('geo.parquet') \n WHERE geometry IS NOT NULL LIMIT {sample_size}\"\n ), \n geometry_column = \"geometry\") |>\n st_set_crs(2154)\n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_aswkb <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n query <- dbGetQuery(\n conn_ddb, \n glue(\n \"\n SELECT * REPLACE(geometry.ST_AsWKB() AS geometry) FROM read_parquet('geo.parquet') \n WHERE geometry IS NOT NULL LIMIT {sample_size}\n \"\n )\n ) |>\n sf::st_as_sf(crs = st_crs(2154))\n dbDisconnect(conn_ddb, shutdown = TRUE)\n}\n\nwith_get_query_astxt <- function() {\n conn_ddb <- dbConnect(duckdb())\n on.exit(dbDisconnect(conn_ddb, shutdown = TRUE))\n dbExecute(conn_ddb, \"LOAD spatial;\")\n \n query <- dbGetQuery(\n conn_ddb, \n glue(\n \"\n SELECT * REPLACE(geometry.ST_AsText() AS geometry) FROM read_parquet('geo.parquet')\n WHERE geometry IS NOT NULL LIMIT {sample_size}\n \"\n )\n ) |>\n sf::st_as_sf(wkt = \"geometry\", crs = st_crs(2154))\n}\n```\n:::\n\n\n\n::: {.cell}\n\n```{.r .cell-code}\nres <- timemoir(\n with_register_geoarrow(), \n with_st_read(),\n with_get_query_aswkb(),\n with_get_query_astxt())\n```\n\n```{.r .cell-code}\nres |>\n kableExtra::kable()\n```\n\n::: {.cell-output-display}\n\n\n|fname | duration|error | start_mem| max_mem| cpu_user| cpu_sys|\n|:------------------------|--------:|:-----|---------:|--------:|--------:|-------:|\n|with_register_geoarrow() | 102.676|NA | 256252| 26895240| 85.800| 16.457|\n|with_st_read() | 552.807|NA | 257016| 25266148| 496.485| 54.894|\n|with_get_query_aswkb() | 593.295|NA | 285444| 25291172| 552.240| 77.263|\n|with_get_query_astxt() | 450.991|NA | 286640| 24853872| 426.793| 63.129|\n\n\n:::\n\n```{.r .cell-code}\nplot(res)\n```\n\n::: {.cell-output-display}\n{width=672}\n:::\n:::\n\n\n## Some useful links\n\nThere isn’t much documentation about this command:\n\n* [A webinar from the R Consortium](https://youtu.be/tjNEoIYr_ag?t=1641) \n* [A geoarrow issue](https://github.com/duckdb/duckdb-spatial/issues/589)\n\n::: {.callout-note collapse=true}\n## Session Information\n\n::: {.cell}\n\n```{.r .cell-code}\ndevtools::session_info(pkgs = \"attached\")\n```\n\n::: {.cell-output .cell-output-stdout}\n\n```\n─ Session info ───────────────────────────────────────────────────────────────\n setting value\n version R version 4.5.0 (2025-04-11)\n os Ubuntu 22.04.5 LTS\n system x86_64, linux-gnu\n ui X11\n language (EN)\n collate en_US.UTF-8\n ctype en_US.UTF-8\n tz Etc/UTC\n date 2025-08-09\n pandoc 3.7.0.2 @ /usr/bin/ (via rmarkdown)\n quarto 1.7.31 @ /usr/local/bin/quarto\n\n─ Packages ───────────────────────────────────────────────────────────────────\n package * version date (UTC) lib source\n arrow * 20.0.0.2 2025-05-26 [1] RSPM (R 4.5.0)\n DBI * 1.2.3 2024-06-02 [1] RSPM (R 4.5.0)\n dplyr * 1.1.4 2023-11-17 [1] RSPM (R 4.5.0)\n duckdb * 1.3.0 2025-06-02 [1] RSPM (R 4.5.0)\n geoarrow * 0.3.0 2025-05-26 [1] RSPM (R 4.5.0)\n glue * 1.8.0 2024-09-30 [1] RSPM (R 4.5.0)\n sf * 1.0-21 2025-05-15 [1] RSPM (R 4.5.0)\n timemoir * 0.8.0.9000 2025-08-09 [1] Github (nbc/timemoir@646734a)\n\n [1] /usr/local/lib/R/site-library\n [2] /usr/local/lib/R/library\n * ── Packages attached to the search path.\n\n──────────────────────────────────────────────────────────────────────────────\n```\n\n\n:::\n:::\n\n:::\n",
0 commit comments