Conversation
…iew and then recreate clean recoded table on top of it for huge speed improvement
Contributor
There was a problem hiding this comment.
Pull Request Overview
This PR introduces a smarter date‐filtering strategy by pushing filters down to the raw CSV views for big performance gains, and standardizes view creation with CREATE OR REPLACE VIEW.
- Switched all
CREATE VIEWstatements in SQL toCREATE OR REPLACE VIEW - Refactored
spod_duckdb_filter_by_dates()to dynamically identify raw views, build a partition‐aware WHERE clause, and recreate filtered views - Updated R code style and named arguments (
con =,source_view_name =,new_view_name =,dates =) and improved metadata factor columns
Reviewed Changes
Copilot reviewed 47 out of 47 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
| inst/extdata/sql-queries/*.sql | Changed CREATE VIEW to CREATE OR REPLACE VIEW for idempotent view definitions |
| R/get.R | Updated spod_duckdb_filter_by_dates() call to use named args |
| R/duckdb-helpers.R | Complete rewrite of spod_duckdb_filter_by_dates() and related helper functions, plus style cleanup |
| R/connect.R | Switched view creation in spod_connect() to CREATE OR REPLACE VIEW |
| R/available-data.R | Wrapped case_when() outputs in factor() with explicit levels |
| NEWS.md | Documented up to 100,000× speed improvements |
Comments suppressed due to low confidence (1)
R/get.R:200
- After renaming the arguments in the call to
spod_duckdb_filter_by_dates(con,source_view_name,new_view_name,dates), update the function's Roxygen documentation and any examples to match these new parameter names for consistency.
con = con,
Collaborator
|
~100,000x speed-ups are vanishingly rare in any walk of life, except software development where I've seen a few such astonishing performance improvements thanks to clever code. Seems like this is a no-brainer. |
Robinlovelace
approved these changes
Jun 14, 2025
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
what was the problem
spod_get()was super slow if you have all (or just many) the files downloaded, but only requested a few dates. This was because of the sequence in which DuckDB queries were executed. Originally described in #159 .We first imported the data into a raw csv table view, then we created a new view that re-coded the data from Spanish to English, applied other improvements such adding factors/ENUMS and some extra columns to make the data more usable. Only on top of that the filtering view was created. That was VERY slow, as for whatever reason, DuckDB in this case was ignoring the hive structure of files and reading ALL files, even though we only needed a few.
what is the fix
Now in this PR I rewrote the filtering function and it does the following.
result
Advantage: now the queries for dates specified in
spod_get()are many times faster. We are taking literally orders of magnitude improvements up to x100,000.spod_convert()automatically takes advantage of these speed gains, as internally it relies inspod_get().Also, we do not need the awkward approach I was thinking of initially. I planned to create temp folders with hive structure and symlink the files for requested dates into them. That would work (I already tested it locally), but it would be a mess to manage, especially on Windows, where symlinks are basically dysfunctional and cannot be created across different dirves/volumes.
details
how it was
And that took a lot of time... On 66 files I have cached it is already a speed up of x6000, on the full dataset, it would be literally up to x100,000.
how it is now