duckdb date filter before recode results in up to 100,000x speed up in spod_get and spod_convert by e-kotov · Pull Request #166 · rOpenSpain/spanishoddata

e-kotov · 2025-06-14T17:36:51Z

what was the problem

spod_get() was super slow if you have all (or just many) the files downloaded, but only requested a few dates. This was because of the sequence in which DuckDB queries were executed. Originally described in #159 .

We first imported the data into a raw csv table view, then we created a new view that re-coded the data from Spanish to English, applied other improvements such adding factors/ENUMS and some extra columns to make the data more usable. Only on top of that the filtering view was created. That was VERY slow, as for whatever reason, DuckDB in this case was ignoring the hive structure of files and reading ALL files, even though we only needed a few.

what is the fix

Now in this PR I rewrote the filtering function and it does the following.

It takes the db connection and checks the SQL code for the clean view.
It identifies the raw view, on which the clean (recoded) view is based on.
It creates a quick date filter (based on year, month, and day, as there are the folders we have in the hive-style file structure in the raw csv data store) directly on the raw csv view ( previously it filtered the recoded clean view).
It then reapplies the recoding and other data improvements that we had in the clean view to the raw-filtered view.

result

Advantage: now the queries for dates specified in spod_get() are many times faster. We are taking literally orders of magnitude improvements up to x100,000. spod_convert() automatically takes advantage of these speed gains, as internally it relies in spod_get().

Also, we do not need the awkward approach I was thinking of initially. I planned to create temp folders with hive structure and symlink the files for requested dates into them. That would work (I already tested it locally), but it would be a mess to manage, especially on Windows, where symlinks are basically dysfunctional and cannot be created across different dirves/volumes.

details

how it was

# remotes::install_github("rOpenSpain/spanishoddata@HEAD") # current latest dev
library(spanishoddata) # or devtools::load_all() # if you have the branch checked out
library(dplyr)
library(duckdb)
spod_set_data_dir("data") # assumes you already have files in there

spod_available_data(ver = 2, check_local_files = TRUE) |>
  filter(
    type == "origin-destination",
    grepl("district", zones),
    downloaded == TRUE
  ) |>
  tally()
# there are 66 files downloaded in my example, but there may be 1000+ files if files for all available dates are downloaded

# we only want to work with 2 files
od <- spod_get("od", zones = "distr", dates = c("2024-03-01", "2024-03-02"))

# check the query of the table view we got from spod_get
od |> show_query()

# extract the connection from the tbl object
con <- od$src$con

# rerun the simple query of the filter table with limit of 10 and duckdb profiling function
dbGetQuery(con, "EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10")

And that took a lot of time... On 66 files I have cached it is already a speed up of x6000, on the full dataset, it would be literally up to x100,000.

analyzed_plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 342.75s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            date           │
│            hour           │
│         id_origin         │
│       id_destination      │
│          distance         │
│      activity_origin      │
│    activity_destination   │
│   study_possible_origin   │
│ study_possible_destination│
│residence_province_ine_code│
│  residence_province_name  │
│           income          │
│            age            │
│            sex            │
│          n_trips          │
│   trips_total_length_km   │
│            year           │
│           month           │
│            day            │
│         time_slot         │
│                           │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           LIMIT           │
│    ────────────────────   │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (((CAST(day AS INTEGER) = │
│     1) OR (CAST(day AS    │
│  INTEGER) = 2)) AND (CAST │
│ (year AS INTEGER) = 2024) │
│     AND (CAST(month AS    │
│       INTEGER) = 3))      │
│                           │
│         4096 Rows         │
│          (8.10s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         Function:         │
│       READ_CSV_AUTO       │
│                           │
│        Projections:       │
│           fecha           │
│          periodo          │
│           origen          │
│          destino          │
│         distancia         │
│      actividad_origen     │
│     actividad_destino     │
│   estudio_origen_posible  │
│  estudio_destino_posible  │
│         residencia        │
│           renta           │
│            edad           │
│            sexo           │
│           viajes          │
│         viajes_km         │
│            day            │
│           month           │
│            year           │
│                           │
│    Total Files Read: 66   │
│                           │
│      1243829436 Rows      │
│         (2102.69s)        │
└───────────────────────────┘

how it is now

analyzed_plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0571s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            date           │
│            hour           │
│         id_origin         │
│       id_destination      │
│          distance         │
│      activity_origin      │
│    activity_destination   │
│   study_possible_origin   │
│ study_possible_destination│
│residence_province_ine_code│
│  residence_province_name  │
│           income          │
│            age            │
│            sex            │
│          n_trips          │
│   trips_total_length_km   │
│            year           │
│           month           │
│            day            │
│         time_slot         │
│                           │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      STREAMING_LIMIT      │
│    ────────────────────   │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         Function:         │
│       READ_CSV_AUTO       │
│                           │
│        Projections:       │
│           fecha           │
│          periodo          │
│           origen          │
│          destino          │
│         distancia         │
│      actividad_origen     │
│     actividad_destino     │
│   estudio_origen_posible  │
│  estudio_destino_posible  │
│         residencia        │
│           renta           │
│            edad           │
│            sexo           │
│           viajes          │
│         viajes_km         │
│            day            │
│           month           │
│            year           │
│                           │
│       File Filters:       │
│  (year = 2024)(month = 3) │
│      (day IN (1, 2))      │
│                           │
│    Scanning Files: 2/66   │
│                           │
│         4096 Rows         │
│          (0.00s)          │
└───────────────────────────┘

…ating them

…iew and then recreate clean recoded table on top of it for huge speed improvement

Copilot

Pull Request Overview

This PR introduces a smarter date‐filtering strategy by pushing filters down to the raw CSV views for big performance gains, and standardizes view creation with CREATE OR REPLACE VIEW.

Switched all CREATE VIEW statements in SQL to CREATE OR REPLACE VIEW
Refactored spod_duckdb_filter_by_dates() to dynamically identify raw views, build a partition‐aware WHERE clause, and recreate filtered views
Updated R code style and named arguments (con =, source_view_name =, new_view_name =, dates =) and improved metadata factor columns

Reviewed Changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.

Show a summary per file

File	Description
inst/extdata/sql-queries/*.sql	Changed `CREATE VIEW` to `CREATE OR REPLACE VIEW` for idempotent view definitions
R/get.R	Updated `spod_duckdb_filter_by_dates()` call to use named args
R/duckdb-helpers.R	Complete rewrite of `spod_duckdb_filter_by_dates()` and related helper functions, plus style cleanup
R/connect.R	Switched view creation in `spod_connect()` to `CREATE OR REPLACE VIEW`
R/available-data.R	Wrapped `case_when()` outputs in `factor()` with explicit levels
NEWS.md	Documented up to 100,000× speed improvements

Comments suppressed due to low confidence (1)

R/get.R:200

After renaming the arguments in the call to spod_duckdb_filter_by_dates (con, source_view_name, new_view_name, dates), update the function's Roxygen documentation and any examples to match these new parameter names for consistency.

con = con,

R/duckdb-helpers.R

inst/extdata/sql-queries/v2-nt-distritos-raw-csv-view.sql

Robinlovelace · 2025-06-14T22:49:29Z

~100,000x speed-ups are vanishingly rare in any walk of life, except software development where I've seen a few such astonishing performance improvements thanks to clever code. Seems like this is a no-brainer.

R/available-data.R

e-kotov added 9 commits June 14, 2025 17:34

apply air linting to duckdb helpers file

5e3599d

update all sql files to create or replace views instead of simply cre…

434fe58

…ating them

create or replace view in spod_connect

5d60d61

explicit arguments in call to spod_duckdb_filter_by_dates in spod_get

c16f22f

rewrite spod_duckdb_filter_by_dates to apply filter to the raw data v…

78cd933

…iew and then recreate clean recoded table on top of it for huge speed improvement

plural in categories in available data

ef1597b

available data format improvements with factors

3085aa1

update docs for available data returned table

70d397a

update news

195a7a6

e-kotov marked this pull request as ready for review June 14, 2025 17:43

e-kotov requested review from Robinlovelace and Copilot June 14, 2025 17:43

Copilot AI reviewed Jun 14, 2025

View reviewed changes

R/duckdb-helpers.R Show resolved Hide resolved

inst/extdata/sql-queries/v2-nt-distritos-raw-csv-view.sql Show resolved Hide resolved

add news about new col in metdata

363cf8b

Robinlovelace approved these changes Jun 14, 2025

View reviewed changes

R/available-data.R Show resolved Hide resolved

e-kotov merged commit c6baf73 into main Jun 14, 2025
5 checks passed

e-kotov deleted the duckdb-date-filter-before-recode branch June 14, 2025 23:06

e-kotov mentioned this pull request Jun 14, 2025

spod_convert is slow when there are more cached files than requested #159

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

duckdb date filter before recode results in up to 100,000x speed up in spod_get and spod_convert#166

duckdb date filter before recode results in up to 100,000x speed up in spod_get and spod_convert#166
e-kotov merged 10 commits intomainfrom
duckdb-date-filter-before-recode

e-kotov commented Jun 14, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Robinlovelace commented Jun 14, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

e-kotov commented Jun 14, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

what was the problem

what is the fix

result

details

how it was

how it is now

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull Request Overview

Reviewed Changes

Uh oh!

Uh oh!

Uh oh!

Robinlovelace commented Jun 14, 2025

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

e-kotov commented Jun 14, 2025 •

edited

Loading