-
Notifications
You must be signed in to change notification settings - Fork 26
Description
A very useful feature would be able to open a reader that accesses many (millions or more) files lazily and only scans/reads them when needed, typically after applying some filtering. A way this might be achieved is by leveraging file naming patterns as they often include the same values for some filtering keys used in sel() methods. The sel() method could then be modified to implement so-called predicate pushdown (simply put: apply any filters you have as close to the data source as possible). During sel(), if a filtering key is found in the file pattern, the selection for that key is "pushed" to the file selection so that only relevant files are scanned.
A similar method is implemented by libraries like polars (or duckdb), e.g. https://pola.rs/posts/predicate-pushdown-query-optimizer/ for Hive Partitioning and it's quite powerful.
In earthkit-data, this would look something like:
fds = earthkit.data.from_source(
"file-pattern",
"path/to/data-{date:date(%Y-%m-%d)}-{step}.grib"
) # nothing is scanned yet, these are potentially millions of files
# scanning is done now, first checking if some of the filtering keys are part of the file-pattern
# in which case only files matching the filter are used, then the other filtering keys are also applied
fds_sel = fds.sel(date=datetime.datetime(2020, 5, 2), step=12, param="2t")Would this be relatively straightforward to implement? Fundamentally the logic is very simple, but I don't know the codebase so well so it would be great to hear your thoughts and maybe some pointers to get started working on this.