Extension system for custom types and/or custom extra_stats-based pruning #702

redox · 2026-01-17T10:57:05Z

redox
Jan 17, 2026

Hello DuckLake team! 🦆

I’ve been experimenting with adding file pruning based on JSON key statistics (min/max + null/missing counts per extracted key path), and it turned out to be (obviously) super effective for our use-case with queries that filter on json_extract_string(...).

The idea is: while writing files, compute per-file extra stats for JSON columns, store them in ducklake_file_column_stats.extra_stats, then during scan, recognize predicates like:

json_extract_string(data, '$.name') = 'zach'
or json_extract_string(data, '$.address.city') >= 'M'
and also handle missing JSON keys

to prune as many files as possible before reading them.

This raised a bigger question for me: should DuckLake have a small, generic "extra stats plugin" hook so features like this can live in dedicated extensions (or out-of-tree) without needing to merge controversial heuristics into core DuckLake?

Concretely, I’m imagining something like:

a way for an extension to register an extra-stats producer for certain column types (e.g., JSON) during stats collection/write
and a way to register a file-pruner that, given bound filters + loaded extra_stats, can decide to skip files (similar stage as existing file pruning)

This feels analogous in spirit to how DuckLake can interop with extensions like spatial (type + behavior), but here the behavior is scan-time pruning rather than a new type.

As an alternative, we could also consider modeling this as a new DuckLake type (or a DuckLake-level “logical type” concept) rather than “just” stats. Writes would encode the data into an existing physical representation (likely still Parquet VARCHAR/STRUCT/BLOB) but with a stable DuckLake-level type string in metadata.

Questions for @pdet and the maintainers team:

do you think a new DuckLake type mapping to a BLOB (or JSON) (even if it’s more invasive) could be better?
Would you be open to a minimal upstream change that adds a stable hook point for "extra stats producers + pruners"?
Is ducklake_file_column_stats.extra_stats considered a stable place to store these kinds of per-file auxiliary stats, or would you prefer a different metadata surface?
Any guidance on where you’d want such a hook to live, and what constraints you’d want?

If this direction sounds reasonable, I’m happy to:

open a PR that only introduces the generic hook
maintain a separate extension implementing JSON key stats + pruning, with tests

Would love your thoughts!

Alex-Monahan · 2026-01-17T20:17:36Z

Alex-Monahan
Jan 17, 2026

Howdy and thanks for the experimentation! [Disclaimer - I am not on the core team!]

Do you think that when DuckLake supports the Variant data type that some of these use cases could be handled with that?

1 reply

redox Jan 17, 2026
Author

Do you think that when DuckLake supports the Variant data type that some of these use cases could be handled with that?

I'm also excited about the upcoming variant type but I wasn't sure whether the plan was to implement key-level stats at the ducklake level right away. Variant brings quite some features compared to JSON but could have the same pruning issues than JSON if it doesn't come with key-level stats.

Curious to hear what the core team has in mind!

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Extension system for custom types and/or custom extra_stats-based pruning #702

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment 1 reply

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

Extension system for custom types and/or custom extra_stats-based pruning #702

Uh oh!

Uh oh!

redox Jan 17, 2026

Replies: 1 comment · 1 reply

Uh oh!

Alex-Monahan Jan 17, 2026

Uh oh!

redox Jan 17, 2026 Author

redox
Jan 17, 2026

Replies: 1 comment 1 reply

Alex-Monahan
Jan 17, 2026

redox Jan 17, 2026
Author