Defer file-pattern source scan and use `sel` for predicate pushdown


A very useful feature would be able to open a reader that accesses many (millions or more) files lazily and only  scans/reads them when needed, typically after applying some filtering. A way this might be achieved is by leveraging file naming patterns as they often include the same values for some filtering keys used in `sel()` methods. The `sel()` method could then be modified to implement so-called predicate pushdown (simply put: apply any filters you have as close to the data source as possible). During `sel()`, if a filtering key is found in the file pattern, the selection for that key is "pushed" to the file selection so that only relevant files are scanned.

A similar method is implemented by libraries like `polars` (or duckdb), e.g. https://pola.rs/posts/predicate-pushdown-query-optimizer/ for Hive Partitioning and it's quite powerful. 

In earthkit-data, this would look something like:
```python
fds = earthkit.data.from_source(
    "file-pattern",
    "path/to/data-{date:date(%Y-%m-%d)}-{step}.grib"
) # nothing is scanned yet, these are potentially millions of files

# scanning is done now, first checking if some of the filtering keys are part of the file-pattern
# in which case only files matching the filter are used, then the other filtering keys are also applied
fds_sel = fds.sel(date=datetime.datetime(2020, 5, 2), step=12, param="2t")
``` 

Would this be relatively straightforward to implement? Fundamentally the logic is very simple, but I don't know the codebase so well so it would be great to hear your thoughts and maybe some pointers to get started working on this.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Defer file-pattern source scan and use `sel` for predicate pushdown #637

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Defer file-pattern source scan and use sel for predicate pushdown #637

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Defer file-pattern source scan and use `sel` for predicate pushdown #637