Skip to content

Commit 677f3ee

Browse files
authored
Compact Compression (#3765)
Added a CompactCompression strategy that generally uses Zstd for binary/string data and Pco for numeric data types. Size comparison, using the NYC Taxi dataset and running e.g. `cargo run --release -p vortex-tui convert fhvhv_tripdata_2023-04.parquet --strategy compact` to produce files: ``` 433M fhvhv_tripdata_2023-04_btrblocks.vortex 334M fhvhv_tripdata_2023-04_compact_8192.vortex 321M fhvhv_tripdata_2023-04_compact_inf.vortex 469M fhvhv_tripdata_2023-04.parquet (zstd compressed) ``` Here the two compact strategies are using up to 8192 values per page versus "inf", as many as possible. Using 8192 (the default I put in the code) slightly increases size but allows for faster access into slices and can (in the non-null, non-list case) line up with statistics nicely for potential pushdown filters. NOT HAPPENING IN THIS PR (leaving this to future work): * compression for bool arrays * compression for decimal array * dict encoding for variable-length types Other changes this incurred: * Added --strategy arg to convert command (and simplified flags stuff a bit) * Added VarBinView support for Zstd encoding * Added unit tests for all the new functionality Fixes #3611 . --------- Signed-off-by: mwlon <m.w.loncaric@gmail.com>
1 parent 6564035 commit 677f3ee

File tree

14 files changed

+762
-98
lines changed

14 files changed

+762
-98
lines changed

Cargo.lock

Lines changed: 5 additions & 0 deletions
Some generated files are not rendered by default. Learn more about customizing how changed files appear on GitHub.

0 commit comments

Comments
 (0)