Skip to content

duckdb date filter before recode results in up to 100,000x speed up in spod_get and spod_convert#166

Merged
e-kotov merged 10 commits intomainfrom
duckdb-date-filter-before-recode
Jun 14, 2025
Merged

duckdb date filter before recode results in up to 100,000x speed up in spod_get and spod_convert#166
e-kotov merged 10 commits intomainfrom
duckdb-date-filter-before-recode

Conversation

@e-kotov
Copy link
Copy Markdown
Member

@e-kotov e-kotov commented Jun 14, 2025

what was the problem

spod_get() was super slow if you have all (or just many) the files downloaded, but only requested a few dates. This was because of the sequence in which DuckDB queries were executed. Originally described in #159 .

We first imported the data into a raw csv table view, then we created a new view that re-coded the data from Spanish to English, applied other improvements such adding factors/ENUMS and some extra columns to make the data more usable. Only on top of that the filtering view was created. That was VERY slow, as for whatever reason, DuckDB in this case was ignoring the hive structure of files and reading ALL files, even though we only needed a few.

what is the fix

Now in this PR I rewrote the filtering function and it does the following.

  1. It takes the db connection and checks the SQL code for the clean view.
  2. It identifies the raw view, on which the clean (recoded) view is based on.
  3. It creates a quick date filter (based on year, month, and day, as there are the folders we have in the hive-style file structure in the raw csv data store) directly on the raw csv view ( previously it filtered the recoded clean view).
  4. It then reapplies the recoding and other data improvements that we had in the clean view to the raw-filtered view.

result

Advantage: now the queries for dates specified in spod_get() are many times faster. We are taking literally orders of magnitude improvements up to x100,000. spod_convert() automatically takes advantage of these speed gains, as internally it relies in spod_get().

Also, we do not need the awkward approach I was thinking of initially. I planned to create temp folders with hive structure and symlink the files for requested dates into them. That would work (I already tested it locally), but it would be a mess to manage, especially on Windows, where symlinks are basically dysfunctional and cannot be created across different dirves/volumes.

details

how it was

# remotes::install_github("rOpenSpain/spanishoddata@HEAD") # current latest dev
library(spanishoddata) # or devtools::load_all() # if you have the branch checked out
library(dplyr)
library(duckdb)
spod_set_data_dir("data") # assumes you already have files in there

spod_available_data(ver = 2, check_local_files = TRUE) |>
  filter(
    type == "origin-destination",
    grepl("district", zones),
    downloaded == TRUE
  ) |>
  tally()
# there are 66 files downloaded in my example, but there may be 1000+ files if files for all available dates are downloaded

# we only want to work with 2 files
od <- spod_get("od", zones = "distr", dates = c("2024-03-01", "2024-03-02"))

# check the query of the table view we got from spod_get
od |> show_query()

# extract the connection from the tbl object
con <- od$src$con

# rerun the simple query of the filter table with limit of 10 and duckdb profiling function
dbGetQuery(con, "EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10")

And that took a lot of time... On 66 files I have cached it is already a speed up of x6000, on the full dataset, it would be literally up to x100,000.

analyzed_plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 342.75s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            date           │
│            hour           │
│         id_origin         │
│       id_destination      │
│          distance         │
│      activity_origin      │
│    activity_destination   │
│   study_possible_origin   │
│ study_possible_destination│
│residence_province_ine_code│
│  residence_province_name  │
│           income          │
│            age            │
│            sex            │
│          n_trips          │
│   trips_total_length_km   │
│            year           │
│           month           │
│            day            │
│         time_slot         │
│                           │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           LIMIT           │
│    ────────────────────   │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│           FILTER          │
│    ────────────────────   │
│ (((CAST(day AS INTEGER) = │
│     1) OR (CAST(day AS    │
│  INTEGER) = 2)) AND (CAST │
│ (year AS INTEGER) = 2024) │
│     AND (CAST(month AS    │
│       INTEGER) = 3))      │
│                           │
│         4096 Rows         │
│          (8.10s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         Function:         │
│       READ_CSV_AUTO       │
│                           │
│        Projections:       │
│           fecha           │
│          periodo          │
│           origen          │
│          destino          │
│         distancia         │
│      actividad_origen     │
│     actividad_destino     │
│   estudio_origen_posible  │
│  estudio_destino_posible  │
│         residencia        │
│           renta           │
│            edad           │
│            sexo           │
│           viajes          │
│         viajes_km         │
│            day            │
│           month           │
│            year           │
│                           │
│    Total Files Read: 66   │
│                           │
│      1243829436 Rows      │
│         (2102.69s)        │
└───────────────────────────┘

how it is now

analyzed_plan
┌─────────────────────────────────────┐
│┌───────────────────────────────────┐│
││    Query Profiling Information    ││
│└───────────────────────────────────┘│
└─────────────────────────────────────┘
EXPLAIN ANALYZE SELECT * FROM od_csv_clean_filtered LIMIT 10
┌────────────────────────────────────────────────┐
│┌──────────────────────────────────────────────┐│
││              Total Time: 0.0571s             ││
│└──────────────────────────────────────────────┘│
└────────────────────────────────────────────────┘
┌───────────────────────────┐
│           QUERY           │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      EXPLAIN_ANALYZE      │
│    ────────────────────   │
│           0 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         PROJECTION        │
│    ────────────────────   │
│            date           │
│            hour           │
│         id_origin         │
│       id_destination      │
│          distance         │
│      activity_origin      │
│    activity_destination   │
│   study_possible_origin   │
│ study_possible_destination│
│residence_province_ine_code│
│  residence_province_name  │
│           income          │
│            age            │
│            sex            │
│          n_trips          │
│   trips_total_length_km   │
│            year           │
│           month           │
│            day            │
│         time_slot         │
│                           │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│      STREAMING_LIMIT      │
│    ────────────────────   │
│          10 Rows          │
│          (0.00s)          │
└─────────────┬─────────────┘
┌─────────────┴─────────────┐
│         TABLE_SCAN        │
│    ────────────────────   │
│         Function:         │
│       READ_CSV_AUTO       │
│                           │
│        Projections:       │
│           fecha           │
│          periodo          │
│           origen          │
│          destino          │
│         distancia         │
│      actividad_origen     │
│     actividad_destino     │
│   estudio_origen_posible  │
│  estudio_destino_posible  │
│         residencia        │
│           renta           │
│            edad           │
│            sexo           │
│           viajes          │
│         viajes_km         │
│            day            │
│           month           │
│            year           │
│                           │
│       File Filters:       │
│  (year = 2024)(month = 3) │
│      (day IN (1, 2))      │
│                           │
│    Scanning Files: 2/66   │
│                           │
│         4096 Rows         │
│          (0.00s)          │
└───────────────────────────┘

@e-kotov e-kotov marked this pull request as ready for review June 14, 2025 17:43
@e-kotov e-kotov requested review from Robinlovelace and Copilot June 14, 2025 17:43
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR introduces a smarter date‐filtering strategy by pushing filters down to the raw CSV views for big performance gains, and standardizes view creation with CREATE OR REPLACE VIEW.

  • Switched all CREATE VIEW statements in SQL to CREATE OR REPLACE VIEW
  • Refactored spod_duckdb_filter_by_dates() to dynamically identify raw views, build a partition‐aware WHERE clause, and recreate filtered views
  • Updated R code style and named arguments (con =, source_view_name =, new_view_name =, dates =) and improved metadata factor columns

Reviewed Changes

Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
inst/extdata/sql-queries/*.sql Changed CREATE VIEW to CREATE OR REPLACE VIEW for idempotent view definitions
R/get.R Updated spod_duckdb_filter_by_dates() call to use named args
R/duckdb-helpers.R Complete rewrite of spod_duckdb_filter_by_dates() and related helper functions, plus style cleanup
R/connect.R Switched view creation in spod_connect() to CREATE OR REPLACE VIEW
R/available-data.R Wrapped case_when() outputs in factor() with explicit levels
NEWS.md Documented up to 100,000× speed improvements
Comments suppressed due to low confidence (1)

R/get.R:200

  • After renaming the arguments in the call to spod_duckdb_filter_by_dates (con, source_view_name, new_view_name, dates), update the function's Roxygen documentation and any examples to match these new parameter names for consistency.
con = con,

@Robinlovelace
Copy link
Copy Markdown
Collaborator

~100,000x speed-ups are vanishingly rare in any walk of life, except software development where I've seen a few such astonishing performance improvements thanks to clever code. Seems like this is a no-brainer.

@e-kotov e-kotov merged commit c6baf73 into main Jun 14, 2025
5 checks passed
@e-kotov e-kotov deleted the duckdb-date-filter-before-recode branch June 14, 2025 23:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants