You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Part of #4889
To better support writing nested data, here we update our builtin
StructLayout in a backwards-compatible fashion.
- `StructStrategy` will shred struct fields into their own StructLayout
recurseively
- StructLayout can support nullable structs now. It does this by writing
a new child layout containing the validity buffer for nullable arrays
I use the `RealNest` dataset to evaluate, which contains a copy of ~200k
github pull request webhook events. Nested struct layout reduces file
size over the previous strategy by about ~10%, and also makes pushdown
into the nested columns possible.
Some open questions
* The validity child requires some extra handling. It seems like the
validity handling is very dependent on the expression being pushed down.
For example if I'm doing a simple project of a child field, then adding
the validity to the result is a simple masking operation. If I'm pushing
down an `UNNEST` or something else that increases the result size, it is
hard to map the validity buffer onto the projection_eval result
* I collect all validity chunks into a single buffer at write time. The
idea being that it's better to access the struct validity as a single
unit since it is much smaller than the data size. Assuming an 8MB target
segment size, this lets us comfortably fit ~64mm rows into a single
segment. Another alternative is to bring back roaring, or enable some
other boolean compressors.
---------
Signed-off-by: Andrew Duffy <[email protected]>
0 commit comments