-
Notifications
You must be signed in to change notification settings - Fork 38
Open
Description
Hi,
I have a case where data is clustered so that each file has a different value for the (integer) column foo I want to filter on. If I query with a simple filter where foo = 1, the file skipping works OK and only one file will be scanned. However, if I query for multiple values with an OR filter where foo = 1 OR foo = 2, this causes a full scan of all files. This seems to affect OR only, as where foo > 9 and foo < 12 also skips files. Here's a reproducible example, requires polars for writing the table:
import duckdb
import polars as pl
for file in range(50):
df = pl.DataFrame(
[
pl.Series("foo", [file], dtype=pl.Int64),
]
)
df.write_delta("test_table4", mode="append")
duckdb.execute("force install delta from core_nightly; UPDATE EXTENSIONS; load delta;").fetchall()
duckdb.execute("ATTACH './test_table4' AS my_delta_table (TYPE delta)")
# this scans only a subset of files as expected
res = duckdb.execute("""
explain analyze SELECT COUNT(*) from my_delta_table
where foo = 10
""").fetchall()
print(res[0][1])
# this scans all files
res = duckdb.execute("""
explain analyze SELECT COUNT(*) from my_delta_table
where foo = 10 or foo = 11
""").fetchall()
print(res[0][1])
# this skips files
res = duckdb.execute("""
explain analyze SELECT COUNT(*) from my_delta_table
where foo > 9 and foo < 12
""").fetchall()
print(res[0][1])
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels