Skip to content

Filtering with OR condition causes all files to be scanned #157

@eeroel

Description

@eeroel

Hi,

I have a case where data is clustered so that each file has a different value for the (integer) column foo I want to filter on. If I query with a simple filter where foo = 1, the file skipping works OK and only one file will be scanned. However, if I query for multiple values with an OR filter where foo = 1 OR foo = 2, this causes a full scan of all files. This seems to affect OR only, as where foo > 9 and foo < 12 also skips files. Here's a reproducible example, requires polars for writing the table:

import duckdb
import polars as pl

for file in range(50):
    df = pl.DataFrame(
        [
            pl.Series("foo", [file], dtype=pl.Int64),
        ]
    )
    df.write_delta("test_table4", mode="append")


duckdb.execute("force install delta from core_nightly; UPDATE EXTENSIONS; load delta;").fetchall()
duckdb.execute("ATTACH './test_table4' AS my_delta_table (TYPE delta)")

# this scans only a subset of files as expected
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo = 10
""").fetchall()
print(res[0][1])

# this scans all files
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo = 10 or foo = 11
""").fetchall()
print(res[0][1])

# this skips files
res = duckdb.execute("""
    explain analyze SELECT COUNT(*) from my_delta_table
    where foo > 9 and foo < 12
""").fetchall()
print(res[0][1])

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions