Skip to content

Defer file-pattern source scan and use sel for predicate pushdown #637

@frazane

Description

@frazane

A very useful feature would be able to open a reader that accesses many (millions or more) files lazily and only scans/reads them when needed, typically after applying some filtering. A way this might be achieved is by leveraging file naming patterns as they often include the same values for some filtering keys used in sel() methods. The sel() method could then be modified to implement so-called predicate pushdown (simply put: apply any filters you have as close to the data source as possible). During sel(), if a filtering key is found in the file pattern, the selection for that key is "pushed" to the file selection so that only relevant files are scanned.

A similar method is implemented by libraries like polars (or duckdb), e.g. https://pola.rs/posts/predicate-pushdown-query-optimizer/ for Hive Partitioning and it's quite powerful.

In earthkit-data, this would look something like:

fds = earthkit.data.from_source(
    "file-pattern",
    "path/to/data-{date:date(%Y-%m-%d)}-{step}.grib"
) # nothing is scanned yet, these are potentially millions of files

# scanning is done now, first checking if some of the filtering keys are part of the file-pattern
# in which case only files matching the filter are used, then the other filtering keys are also applied
fds_sel = fds.sel(date=datetime.datetime(2020, 5, 2), step=12, param="2t")

Would this be relatively straightforward to implement? Fundamentally the logic is very simple, but I don't know the codebase so well so it would be great to hear your thoughts and maybe some pointers to get started working on this.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions