-
Notifications
You must be signed in to change notification settings - Fork 36
Description
Currently date time are treated just like integers:
| pub type LiquidDate32Array = LiquidPrimitiveArray<Date32Type>; |
According to Arrow spec, Date32 is just number of days since UNIX epoch (1970-01-01).
It's absolutely correct and efficient to just treat them as integers and will automatically benefit from the bitpacking and integer squeeze.
However, here's a better approach.
Observation:
From public_bi benchmark, we find a lot of date time usage is in the form of EXTRACT(YEAR from datetime): https://github.com/search?q=repo%3Acwida%2Fpublic_bi_benchmark+extract&type=code&p=1
It makes no sense to cache Month and Day if only Year is used.
Solution:
When a Date32 column is about to be evicted, we analyze its most frequently accessed filed, i.e., one of YEAR, MONTH, DAY; and squeeze the array to only keep that field.
For example, if we know only YEAR is used, then we can squeeze the YEAR out of the original Date32, and only cache that part.
Concrete steps:
- Measure how many times Date32 is been used in public_bi dataset. Find out how many times: (1) the entire date time is used, (2) the field YEAR/MONTH/DAY is used.
- Implement a
SqueezedDate32Array, where it contains the squeezed YEAR, MONTH or DAY. We will bit pack this array using the same bitpacking we did in our primitive arrays. - Decide which field to squeeze (YEAR, MONTH or DAY) based on metadata? where should we track metadata, what granularity should we do this?
- Integrate with the datafusion. This is the tricky part. See below.
How to make DataFusion pushdown the EXTRACT expression?
Currently it doesn't. I believe there's an issue for it: apache/datafusion#14993
It's not super clear to me how EXTRACT expression can be pushed down to data source, and how long it takes to implement it.
BUT, we can do this optimization even before datafusion has it.
First, write a optimizer rule to analyze the plan, and find out that the date32 field is only used to extract YEAR/MONTH/DAY, then we annotate this field and tell liquid cache that we only need that field.
Second, let's say we only care about YEAR field, and liquid cache knows this and we happened to cached this squeezed YEAR. Then we convert back from squeezed YEAR to full date32, with month and day all 0s. This is safe because we know later it will be extracted, and only YEAR is used.
@alamb pointed out that ClickHouse has a similar (but slightly different imo) approach for Date32 (https://www.vldb.org/pvldb/vol17/p3731-schulze.pdf)
Second, some functions can compute the preimage of a given function result. This is used to replace comparisons of constants with function calls on the key columns by comparing the key column value with the preimage. For example, toYear(k) = 2024 can be replaced by k >= 2024-01-01 && k < 2025-01-01.