Replies: 1 comment 1 reply
-
|
Howdy and thanks for the experimentation! [Disclaimer - I am not on the core team!] Do you think that when DuckLake supports the Variant data type that some of these use cases could be handled with that? |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Hello DuckLake team! 🦆
I’ve been experimenting with adding file pruning based on JSON key statistics (min/max + null/missing counts per extracted key path), and it turned out to be (obviously) super effective for our use-case with queries that filter on
json_extract_string(...).The idea is: while writing files, compute per-file extra stats for JSON columns, store them in
ducklake_file_column_stats.extra_stats, then during scan, recognize predicates like:json_extract_string(data, '$.name') = 'zach'json_extract_string(data, '$.address.city') >= 'M'to prune as many files as possible before reading them.
This raised a bigger question for me: should DuckLake have a small, generic "extra stats plugin" hook so features like this can live in dedicated extensions (or out-of-tree) without needing to merge controversial heuristics into core DuckLake?
Concretely, I’m imagining something like:
This feels analogous in spirit to how DuckLake can interop with extensions like spatial (type + behavior), but here the behavior is scan-time pruning rather than a new type.
As an alternative, we could also consider modeling this as a new DuckLake type (or a DuckLake-level “logical type” concept) rather than “just” stats. Writes would encode the data into an existing physical representation (likely still Parquet VARCHAR/STRUCT/BLOB) but with a stable DuckLake-level type string in metadata.
Questions for @pdet and the maintainers team:
ducklake_file_column_stats.extra_statsconsidered a stable place to store these kinds of per-file auxiliary stats, or would you prefer a different metadata surface?If this direction sounds reasonable, I’m happy to:
Would love your thoughts!
Beta Was this translation helpful? Give feedback.
All reactions